Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Use of standards and related issues in predictive analytics

1,443 views

Published on

My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html

Published in: Technology

Use of standards and related issues in predictive analytics

  1. 1. Use of standards and related issues in predictive analytics KDD 2016, SF 2016-08-16 Paco Nathan, @pacoid
 Dir, Learning Group @ O’Reilly Media
  2. 2. PMML referenced by 86 publications in Safari, 2001-2016
 https://www.safaribooksonline.com/search/?query=PMML
  3. 3. Pattern: PMML for Cascading and Hadoop
 P Nathan, G Kathalagiri (2013-08-11)
 https://goo.gl/jk7829
  4. 4. Customer Orders Classify Scored Orders GroupBy token Count PMML Model M R Failure Traps Assert Confusion Matrix Pattern – score a model, using pre-defined Cascading app cascading.org/projects/pattern
  5. 5. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far in real-world workflows… Results shown in blue, hard problems highlighted in red Generalized Workflow for ML Use Cases in Big Data
  6. 6. Portable Format for Analytics (PFA) PFA updates the standards w.r.t. more contemporary issues of system architectures used for predictive analytics: distributed processing, in-memory computing, serialization, etc. http://dmg.org/pfa/docs/motivation/ • much more support for distributed systems • Avro data types • forward-looking toward more streaming applications • fits well with higher layers of abstraction, success of DSLs, etc.
  7. 7. Tuning Spark Streaming for Throughput Gerard Maas, Virdata (2014-12-22) “One Size Fits All” Doesn’t Anymore
 This common architectural pattern requires interchange…
  8. 8. bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-machine- and-then-uses-sensors-to-listen-to-it/ IoT alters “velocity” and “volume” dramatically
 This growing category of use cases requires interchange…
  9. 9. Lessons from the success of Apache Spark… interchange is necessary for the ecosystem major use cases tend to build their own ML libraries – despite a case where a majority of committers tend to support a common vision and encourage use of a canonical library (MLLib with DataFrames) when a successful business grows over time, challenges arise by definition: managing separated teams, mergers and acquisitions, increased audits, regulations, etc. therefore, lack of interchange for analytics represents a serious technical debt and potential liability
  10. 10. Tungsten Execution PythonSQL R Streaming DataFrame Advanced Analytics Physical Execution: CPU Efficient Data Structures Keep data closure to CPU cache Tungsten Lessons from the success of Apache Spark… direct use of “compilers” becomes atypical as abstraction layers become smarter for deferred optimization
  11. 11. What to suggest for existing standards? microservices: how to compose models + parameters from multiple/distinct services support for API definitions in Swaggar http://swagger.io/ consider the benefits of Parquet, e.g., how pushdown predicates enable better optimization of workflows
  12. 12. What to suggest for existing standards? additional standards emerging for other aspects of workflow definition: Jupyter http://jupyter.org/
 
 create and share documents that contain live code, equations, visualizations and explanatory text — 
 a network protocol suite, at heart, for distributed REPL environments, often along with containerization see usage in Oriole http://oreilly.com/oriole/index.html
 Dat http://dat-data.com/ shares versioned data through a decentralized network
  13. 13. What to suggest for existing standards? other lingering issues: • data lineage / provenance • metadata drift • public dialog and law:
 https://public.resource.org/about/
  14. 14. presenter: Just Enough Math O’Reilly (2014) justenoughmath.com monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/

×