Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Transforming Data to Unlock Its Latent Value

393 views

Published on

At the heart of data analysis, there lies a need to understand the real world entities being represented in the data. Every data set we encounter is an attempt to capture a slice of our complex world and communicate some information about it in a way that has potential to be informative to humans, machines, or both. Moving from basic analyses to advanced analytics requires the ability to imagine multiple ways of conceptualizing the composition of entities and the relationships present in our data. It also requires the realization that different levels of aggregation, disaggregation, and transformation can open up new pathways to understanding our data and identifying the valuable insights it contains.

In this talk, we’ll discuss several ways to think about the composition and representation of our data. We’ll also demonstrate a series of methods that leverage tools like networks, hierarchical aggregations, and unsupervised clustering to visually explore our data, transform it to discover new insights, help frame analytical problems and questions, and even improve machine learning model performance. In exploring these approaches, and with the help of Python libraries such as Pandas, Scikit-Learn, Seaborn, and Yellowbrick, we will provide a practical framework for thinking creatively and visually about your data and unlocking latent value and insights hidden deep beneath its surface.

Published in: Data & Analytics
  • Be the first to comment

Transforming Data to Unlock Its Latent Value

  1. 1. TRANSFORMING DATA TO UNLOCK ITS LATENT VALUE PyData Carolinas 9/15/2016
  2. 2. How many of you consider yourselves data scientists?
  3. 3. How many of you spend a lot of time exploring data?
  4. 4. How many of you have a formal process for data exploration?
  5. 5. ABOUT ME - TONY OJEDA Founder of District Data Labs Education and research company Business & finance background Self-taught programmer (R & Python)
  6. 6. HOW I THINK ABOUT DATA
  7. 7. PUT THINGS IN ORDER Categories, Classifications, Taxonomies, Ontologies.
  8. 8. EXPLORATION FRAMEWORK Identify Types of  Information Entities  in  Data  Set Review Transformation   Methods Visualization  Methods Create Category  Aggregations Continuous Bins Cluster  Categories Prep  Phase Insights Over  Time Visualization Filter  +  Aggregate Field  Relationships Entity  RelationshipsExplore  Phase
  9. 9. THE DATA: EPA VEHICLE FUEL ECONOMY http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip
  10. 10. IDENTIFY Identify Types of  Information Entities  in  Data  Set Review Transformation   Methods Visualization  Methods Create Category  Aggregations Continuous Bins Cluster  Categories
  11. 11. IDENTIFY TYPES OF INFORMATION
  12. 12. IDENTIFY ENTITIES IN THE DATA
  13. 13. ENTITIES IN OUR DATA SET Level 4 Level 3 Level 2 Level 1 Year + Model Year + Model Type Year + Make Year Year + Vehicle Class Vehicle Class Model Type Make
  14. 14. ENTITIES IN OUR DATA SET Level 4 Level 3 Level 2 Level 1 2016 Ford Mustang 2.3L V4 Automatic Rear-wheel Drive 2016 Ford Mustang 2016 Fords All 2016 Vehicles 2016 Subcompacts All Subcompacts Ford Mustangs All Ford Vehicles
  15. 15. REVIEW Identify Types of  Information Entities  in  Data  Set Review Transformation   Methods Visualization  Methods Create Category  Aggregations Continuous Bins Cluster  Categories
  16. 16. TRANSFORMATION METHODS ! Filtering " Aggregation/Disaggregation # Pivoting $ Graph Transformation
  17. 17. VISUALIZATION METHODS Barcharts | Multi-line Graphs & Scatter plots/matrices Heatmaps ' Network Visualizations $
  18. 18. CREATE Identify Types of  Information Entities  in  Data  Set Review Transformation   Methods Visualization  Methods Create Category  Aggregations Continuous Bins Cluster  Categories
  19. 19. CATEGORY AGGREGATIONS Transmission Automatic  3-­‐spd Automatic  4-­‐spd Manual  5-­‐spd Automatic  (S5) Manual  6-­‐spd Automatic  5-­‐spd Auto(AM8) Auto(AV-­‐S7) Automatic  (S6) Automatic  (S9) Manual  4-­‐spd +  33  more Transmission  Type Automatic Manual
  20. 20. CATEGORY AGGREGATIONS Vehicle  Class Special  Purpose  Vehicle  2WD Midsize  Cars Subcompact  Cars Compact  Cars Sport  Utility  Vehicle  -­‐ 4WD Small  Sport  Utility  Vehicle  2WD Small  Sport  Utility  Vehicle  4WD Two  Seaters Small  Station  Wagons Minicompact Cars Minivan  -­‐ 4WD + 23  more Vehicle  Category Small  Cars Midsize  Cars Large  Cars Station  Wagons Pickup  Trucks Special  Purpose Sport  Utility Vans  &  Minivans
  21. 21. CATEGORIES FROM CONTINUOUS Very  Low Low Moderate High Very  High Combined MPG à Fuel Efficiency Quintiles Engine Displacement à Engine Size Quintiles CO2 Emission à Emission Quintiles Fuel Cost à Fuel Cost Quintiles
  22. 22. CLUSTER CATEGORIES Takes multiple fields into consideration together. Groups things in ways you may not have thought of. Come up with descriptive names for clusters. Number of clusters? Looking for relatively clear boundaries. Automatically creates new categories (saves time).
  23. 23. VEHICLE CLUSTERS = 8
  24. 24. VEHICLE CLUSTERS = 4
  25. 25. ASSIGN DESCRIPTIVE NAMES Cluster 0 à Small Very Efficient Cluster 1 à Large Inefficient Cluster 2 à Midsized Balanced Cluster 3 à Small Moderately Efficient
  26. 26. EXPLORE PHASE Insights Over  Time Visualization Filter +  Aggregate Field  Relationships Entity  Relationships
  27. 27. VECHICLE CATEGORY COUNTS (2016)
  28. 28. VECHICLE CATEGORY COUNTS (1985)
  29. 29. ENGINE SIZE COUNTS (2016)
  30. 30. FUEL EFFICIENCY COUNTS (2016)
  31. 31. VEHICLE CLUSTER COUNTS (2016)
  32. 32. MANUFACTURER VEHICLE COUNTS (2016)
  33. 33. MORE DETAIL
  34. 34. FUEL EFFICIENCY VS. ENGINE SIZE (2016)
  35. 35. FUEL EFFICIENCY VS. ENGINE SIZE (1985)
  36. 36. ENGINE SIZE & EFFICIENCY VS. CATEGORY
  37. 37. PIVOT COUNTS BY MAKE & CATEGORY
  38. 38. CHANGES OVER TIME Insights Over  Time Visualization Filter  +  Aggregate Field  Relationships Entity  Relationships
  39. 39. CATEGORIES OVER TIME
  40. 40. BMW OVER TIME
  41. 41. TOYOTA OVER TIME
  42. 42. EXPLORE PHASE Insights Over  Time Visualization Filter +  Aggregate Field  Relationships Entity  Relationships
  43. 43. SCATTER MATRICES& PLOTS
  44. 44. SCATTER MATRIX WITH CATEGORIES
  45. 45. ENGINE SIZE VS. EFFICIENCY
  46. 46. ENGINE SIZE VS. FUEL COST
  47. 47. EXPLORE PHASE Insights Over  Time Visualization Filter  +  Aggregate Field  Relationships Entity  Relationships
  48. 48. GRAPH ANALYSIS Relationships between entities. Attributes entities have in common. Actions one entity takes involving another. Changes in relationships over time.
  49. 49. RELATIONAL TO GRAPH TRANSFORMATION
  50. 50. RELATIONAL TO GRAPH TRANSFORMATION
  51. 51. MANUFACTURER NETWORK (2016)
  52. 52. EGO GRAPH FOR NISSAN
  53. 53. COMMUNITY GRAPH (2016)
  54. 54. EDGE WEIGHTS OVER TIME
  55. 55. FILTER FOR SPECIFIC MAKES
  56. 56. RECAP
  57. 57. EXPLORATION FRAMEWORK Identify Types of  Information Entities  in  Data  Set Review Transformation   Methods Visualization  Methods Create Category  Aggregations Continuous Bins Cluster  Categories Prep  Phase Insights Over  Time Visualization Filter  + Aggregate Field  Relationships Entity  RelationshipsExplore  Phase
  58. 58. SO MANY INSIGHTS!
  59. 59. NO REALLY… SO MANY! Significantly more small cars than other types in 2016. Sport Utility Vehicles are currently next most popular. Midsize Cars currently third most popular. Other vehicle types currently not as popular. Sport Utility Vehicles didn't exist in 1985. Small Cars were even more popular in 1985. Pickup Trucks were also more popular in 1985. Special Purpose Vehicles were more popular as well. Most vehicles today have very small or moderate sized engines. Most vehicles today are very fuel efficient. Few vehicles today have large engines. Even fewer have low fuel efficiency. BMW currently makes the most vehicle models, followed by Chevy and Ford. Pagani and Alfa Romeo make the least number of vehicle models. Smaller cars with smaller engines are most fuel efficient. Currently no small engine vehicles with low efficiency. Even Vans and Station Wagons are relatively fuel efficient these days. Each vehicle category has varying engine sizes. BMW is doubling down on small cars. So is Porsche. Ford, Chevrolet, & Nissan are going for breadth. Jeep and Land Rover are focused solely on SUVs. Ram is focused solely on Pickup Trucks A few companies are focused only on small cars. Toyota used to make a lot of small cars, but now makes less in favor of SUVs and Pickup Trucks. Overall surge in small efficient engine vehicles over last 10 years. Mostly at expense of moderate/large inefficient engines. Even Large Cars are relatively fuel efficient these days. Linear relationships between engine size and fuel cost and emmissions. Exponential relationships between efficiency and engine size and fuel cost. Clustering into 4 groups results in relatively clear boundaries in the data. Manufacturers with both depth and breadth of vehicle attributes have most connections. Manufacturers that specialize are positioned toward edge of network. Four distinct communities detected and connections over time have converged.
  60. 60. THANK YOU! ( tojeda@districtdatalabs.com ) linkedin.com/in/tonyojeda * @tonyojeda3 + http://districtdatalabs.com , http://bit.ly/PyDataNC (code)

×