Capabilities Brief Analytics


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Capabilities Brief Analytics

  1. 1. DT Core Analytical Competencies Data Engineering⁻ Data Architecture Design and Development⁻ Large Scale Enterprise Architecture and Design⁻ Migrate, Extract, Transform, and Load Data⁻ Spatial, Multi-Domain, and Cloud Base Data ServicesAnalytics – Quantitative⁻ Data Transformation and Ingestion⁻ Dissemination and Reporting Tools⁻ Data Mining, Exploitation, and Correlation Tools⁻ Spatial Data Mining and Geographic Knowledge Discovery Data Tactics Corporation Proprietary and Confidential Material
  2. 2. DT Core Analytical CompetenciesThe Team:Graduates of top tier universities to includeStanford, Caltech and MIT as well as ties to these andlocal universities.Degrees include Mathematics, ComputerScience, AeronauticalEngineering, Astrophysics, ElectricalEngineering, Mechanical Engineering, Statistics andSocial Sciences.Competencies include data mining, machinelearning, statistics, spatial statistics, Bayesianstatistics, econometrics, computational geometry, spatialeconometrics, applied mathematics, theoreticalrobotics, dynamic systems, control theory.Foci include unsupervised cross-modal clusteringalgorithms, principle component analysis, independentcomponent analysis, regression, spatialregression, geographic weighted regression, zeroth orderprocessing, nonlinear optimization, autoregressivemodels, time-series analysis, spatial regime models, HACmodels.Technical Competencies include Data Tactics Corporation Proprietary and Confidential Material
  3. 3. Data Tactics Analytics Cell Data Tactics Corporation Proprietary and Confidential Material
  4. 4. Analytics Competencies ZeroFill 40• Time Series Analytics (i) (i) 0 • Applying the ARIMA model in a 02-13 Index parallelized environment to provide anomaly detection• Correlation Analytics (ii) • Brute force pairwise Pearson‟s correlation over vectors in a cloud-backed engine• Aggregation Analytics (iii) • Aggregate micro-pathing • Repurposing data to analyze (ii) and display movement patterns • Dwell time calculations • Analytic to discover areas of interest based on movement activity• Graph Analytics (iiii) • Discovering social interaction models and paradigms within (iii) network data (iiii) 4 Data Tactics Corporation Proprietary and Confidential Material
  5. 5. Analytics Competencies• Directional Spatio-Temporal Analytics (i) (i) • Compare distributions with a focus on changes in morphology of the distribution and mobility of individual observations within the distribution over that same period of time over space (Wy)• Local Classification (ii) • Non-self-similarities & self-similarities; (i) within and between group correlations.• Ecological Analytics (ii) • Regression Modeling • Spatial Regression • Spatial Regime Models • HAC Models 5 Data Tactics Corporation Proprietary and Confidential Material
  6. 6. Data Tactics Data Repository Data Tactics Corporation Proprietary and Confidential Material
  7. 7. Quantitative Data Competencies• Proxy problems definition – Different problems lead to different questions, which lead to different data sets. Confer acceptability of data source by the definition of the proxy problems.• Key dimensions of variability – Key dimensions were targeted for collection such as time, space, identifier, etc. However, different proxy problems require different key dimensions.• Capturing scope – The following was explicitly captured: • Data structure (E.G. graph relationship data vs. graph transaction data vs. dimensional data) • Data timespan (if time is a dimension) • Data geospatial footprint (if geospatial is a dimension) • Data volume (both in total GB and also in total # of rows) • Determining dataset overlap• Capturing opinions - Current star ratings based on: • Data consistency, volume, and persistence • Data coverage (time and space) • Data precision (time and space) • Data “genuineness” (synthesized data is penalized) • Data distribution (IE: we may have extremely precise geo-spatial data, but if there are only 40 unique geospatial points in the data, the geo-spatial aspects aren‟t that interesting) • Data dimensionality (higher dimensionality with reasonable distributions on each dimension is preferred)
  8. 8. Quantitative Data HoldingsName of the Data Date that statisticsSource were last collected Initial reviewer on data Location of data Data Opinion of Data Source where on FTP site format Quality Collection start / data was Description and end dates – if acquired Size of Data notes on data source known (storage space as well as collection Geospatial and rows) Data handling information coverage requirements Data Tactics Corporation Proprietary and Confidential Material 10
  9. 9. Quantitative Data HoldingsArmed Conflict Location and Events Dataset (ACLED) KDD 2003 DataAIS Ship Data KDD 2005 DataAtmospherics Reports Kiva DataBrightKite Data Landscan DataClassified Ads LiveJournal DataCNN Meme TrackerDigital Terrain Elevation Data (DTED) Meme Twitter TSEnron Data NFL PlaysEpinions Data Night Lights DataEU Email Open Data Airtraffic accidentsFacebook Open Street MapsFlickr Data Panoramio DataFlight Information Data Patent Citations DataFour Square Data Photobucket DataFriend Feed Data Picasa Web Albums DataGeolife Data Processed Employment DataGowalla Data Scamper DataInternational Conference on Weblogs and Social Media ISVG(ICWSM) Data TwitterIdentica Data UNDPIMDB Data Weather DataKnowledge Discovery and Data (KDD) Mining Tools WebgraphsCompetition Youtube Data Tactics Corporation Proprietary and Confidential Material
  10. 10. Quantitative Data Competencies Panoramio / Flickr – Metadata on uploaded public photos provides excellent geospatial and temporal resolution, which also provides user information. Estimated 250 million rows of photo metadata with over 150 million already gathered. AIS – Ship tracking data that provides ship „pings‟ as they progress in movement. Precise time and geospatial information provided. 50 million records and counting. OpenStreetMaps – Over 2 billion geospatial points of mapping enthusiasts‟ tracks across the world. Time and userid information also included. Gowalla / Brightkite – About 11 million FourSquare style check-ins with user, location, and time information provided. Example Proxy Problems: • Discovering “Holes” in the data where photos are no longer taken to detect avoided areas • Discovering relationships and links based on co-occurrence between users in time / space • Tracking and analyzing movement patterns on a local and global scale • Analyzing image data for changes in the same locations • Detecting differences in photo activity in an area over time • Detecting events based on abnormal photo activity behavior • Mapping UserIds across data sources to create a unified analytic picture • Detecting home range for each user • Defining patterns of life by routine activities and movement • Tracking language usage in areas to determine abnormal language presence in an area • Local vs tourist movement analysis and extraction • Trending of location popularity UNCLASSIFIED 12
  11. 11. Quantitative Data Competencies Twitter – Sampled ongoing collection of social media tweets with UserId and time. Some even have precise location data, but this is not the norm. Collection pulls roughly between 1-2 million tweets / day. Example Proxy Problems: • Discovery of crowd-sourced phenomena (e.g., people posting to beware of a certain neighborhood) • Discovery of correlated trends (e.g., finding that people posting about a certain topic in an area correlates to higher crime in that area) • Tracking sentiment on certain topics and issues • Tracking language usage in areas to determine abnormal language presence in an area UNCLASSIFIED 13
  12. 12. Quantitative Data Competencies• How can we infer movement patterns from vast amounts of what appears to be just point data collected in time and associated with an identifier (IE: UserId / bank account / etc)?• Technique is applicable to Twitter, FourSquare and MANY other sources Volume plot of photos binned by area on log scale Paris as seen from Flickr over all time 14
  13. 13. Quantitative Data Competencies 1. Goal: to catch active moment between locations a small distance apart 2. Typically two to around a dozen points chained together 3. Located in a small area, but with a definite path through the area 4. Sampled in rapid succession (less than X seconds between points) 5. Thousands or millions of micro-paths make a full path to view Segment ignored: Segment ignored: Velocity too fast Photo taken 120 seconds between points Photo taken Photo taken 2012-08-15 12:35:25 2012-08-15 12:34:59 2012-08-15 12:37:46 Photo taken 2012-08-15 12:37:35 Photo taken 2012-08-15 12:35:11 Person A Common Photo taken path 10 seconds 2012-08-15 12:37:25 Person B 3 seconds pattern A Micropath example forming Person C Overlay thousands / millions of these tiny micropaths together and you get… UNCLASSIFIED 15
  14. 14. Quantitative Data Competencies View of Paris using a 60 second segment timeout and 80km/hour cutoff on Flickr data Arc de Triomphe Apparent typical approach pathway to the Arc Place de la Concorde Louvre Harder to see, but Place de la you can see the Eiffel Tower Concorde typically typical approach / approached from exit pathways from southern direction Notre Dame. Notre Dame Red strip appears to be line of sight to the Eiffel Tower UNCLASSIFIED 16
  15. 15. Quantitative Data Competencies Aggregate micro-pathing on a world of photo metadata with no speed, time, or distance restrictions UNCLASSIFIED 17
  16. 16. Quantitative Data Competencies AIS ship tracking micro-path blanket with no time / space filters Japan‟s south coast China‟s coast with high levels of activity Coast of Taiwan UNCLASSIFIED 18
  17. 17. Quantitative Data CompetenciesFlickr Paris 2004 changes vs 2005 Hh: [HIGH, high]- an increase between Xt1 -> Xt2 relative to respective (Xt1, Xt2) reference distribution where t1, t2 belong to T. HIGH reflects a strong increase of ones own values (dxi) at location i between t1 and t2 relative to the change of neighboring values (dy). high reflects a modest increase of dy relative to values of dx. Neighbors are defined with the spatially lagged variable Wy, as the eight nearest observations. lL: low, LOW [low, LOW]- a decrease between Xt1 -> Xt2 relative to respective (Xt1, Xt2) reference distribution where t1, t2 belong to T. low reflects a modest decrease of ones own values (dxi) at location i between t1 and t2 relative to the change of neighboring values (dy). LOW reflects a strong decrease of neighboring values of dx. Neighbors are defined with the spatially lagged variable Wy, as the eight nearest observations.Flickr Paris 2011 changes vs 2010 UNCLASSIFIED 19
  18. 18. Quantitative Data CompetenciesNew Year provides lots of photos Paris Bastille Day Recurrent red strips show the recurring weekend Number of distinct photographers Day in year UNCLASSIFIED 20
  19. 19. Quantitative Data Competencies5 day Carnival celebration Caracas Some interesting dates for low volume activity Number of distinct photographers Day in yearImage from UNCLASSIFIED 21
  20. 20. Quantitative Data Competencies Airline Flight Data Anomaly Detection During an unusual event, such as a winter storm show below, the ARIMA still follows the pattern but doesn‟t match as well. These areas where the red and black don‟t match are where unusual events have occurred.ZeroFill 40 0 02-13 IndexZeroFill 40 0 02-13 Index Plot of the count of points where the difference between the expected number of flights leaving an airport based on the model and the actual observed number of flights was statistically significant. UNCLASSIFIED 22
  21. 21. Quantitative Data Competencies Raw data file: Each line is a comma separated list of values. key1, timestamp, value Key1 2.4,3.4,0.99,… key2, timestamp, value Key2 3.4,4.3,1.0,0.6…. Cloud-backed ….. … transformation Vector file: Each line has a key and a comma separated list of values. Correlation analytic Implemented in: key1 Key2 Key3 Key4 • Python (RAM) Key1 - 0.93 0.43 0.001 • Hive Key2 - - -0.5 -0.03 • Mahout • Spark Key3 - - - .32 • Giraph Key4 - - - - • Cascalog For each vector calculate the correlation to each other vector. We use a Pearson correlation. UNCLASSIFIED 23
  22. 22. Quantitative Data Competencies Training Test Approximation engine for the O(n²) correlation Engine Engine matrix problem Spark Technique based on Google Correlate Approximation provides orders of magnitude of speedup when compared to equivalent brute force methods. The technique works best for highly correlated items and uses a series of data projections, unsupervised learning, and vector quantization to provide dimensionality reduction for incoming complex vectors. UNCLASSIFIED 24