Fighting Fraud in Medicare
with Apache Spark
Miklos Christine
Solutions Architect
mwc@databricks.com, @Miklos_C
About Me: Miklos Christine
Solutions Architect @ Databricks
- Assist customers architect big data platforms
- Help customers understand big data best practices
Previously:
- Systems Engineer @ Cloudera
- Supported customers running a few of the largest clusters in the
world
- Software Engineer @ Cisco
Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.
Next Generation Big Data Processing Engine
• Started as a research project at UC Berkeley in 2009
• 600,000 lines of code (75% Scala)
• Last Release Spark 1.6 December 2015
• Next Release Spark 2.0
• Open Source License (Apache 2.0)
• Built by 1000+ developers from 200+ companies
9
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL
SparkML /
MLLib
Graph
Frames /
GraphX
Unified engine across diverse workloads &
environments
Scale out
fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries
History of Spark APIs
RDD
(2011)
DataFrame
(2013)
Distribute collection
of JVM objects
Functional Operators (map,
filter, etc.)
Distribute collection
of Row objects
Expression-based
operations and UDFs
Logical plans and optimizer
Fast/efficient internal
representations
DataSet
(2015)
Internally rows, externally
JVM objects
Almost the “Best of both
worlds”: type safe + fast
But slower than DF
Not as good for interactive
analysis, especially Python
Apache Spark 2.0 API
DataSet
(2016)
• DataFrame = Dataset[Row]
• Convenient for interactive analysis
• Faster
DataFrame
DataSet
Untyped API
Typed API
• Optimized for data engineering
• Fast
Benefit of Logical Plan:
Performance Parity Across Languages
DataFrame
RDD
Machine Learning with
Apache Spark
Why do Machine Learning?
• Machine Learning is using
computers and algorithms to
recognize patterns in data
• Businesses have to Adapt Faster to
Change
• Data driven decisions need to be
made quickly and accurately
• Customers expect faster responses
15
From Descriptive to Predictive to Prescriptive
16
•
•
Data Science Time
17
Iterate on Your Models
18
•
•
•
•
Spark ML
Why Spark ML
Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)
Advantages of MLlib’s Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility
SparkML
ML Pipelines provide:
• Integration with DataFrames
• Familiar API based on
scikit-learn
• Easy workflow inspection
• Simple parameter tuning
21
Databricks & SparkML
• Use DataFrames to directly access data (SQL, raw files)
• Extract, Transform and Load Data using an elastic cluster
• Create the model using all of the data
• Iterate many times on the model
• Deploy the same model to production using the same code
• Repeat
Advantages for Spark ML
• Data can be directly accessed using the Spark Data Sources API (no more endless
hours copying data between systems)
• Data Scientist can use all of the data rather than subsamples and take advantage of
the Law of Large numbers to improve model accuracy
• Data Scientist can scale compute needs with the data size and model complexity
• Data Scientists can iterate more giving them the opportunity to create better models
and test and release more frequently
SparkML - Tips
• Understand Spark Partitions
• Parquet file format and compact files
• coalesce() / repartition()
• Leverage Existing Functions / UDFs
• Leverage DataFrames and SparkML
• Iterative Algorithms
• More cores for faster processing
24
What’s new Spark 2.0
Spark 2.0 - SparkML
• MLLib is deprecated and in maintenance mode
• New Algorithm Support
• Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler
feature transformer.
• PySpark Update
• LDA, Gaussian Mixture Model, Generalized Linear Regression
• Model Persistence across languages
26
Spark Demo
Thanks!
Sign Up For Databricks Community Edition!
https://databricks.com/try-databricks
Learning more about MLlib
Guides & examples
• Example workflow using ML Pipelines (Python)
• Power plant data analysis workflow (Scala)
• The above 2 links are part of the Databricks Guide, which contains many more
examples and references.
References
• Apache Spark MLlib User Guide
• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API
documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015.
http://arxiv.org/abs/1505.06807 (academic paper)
29

Fighting Fraud with Apache Spark

  • 1.
    Fighting Fraud inMedicare with Apache Spark Miklos Christine Solutions Architect mwc@databricks.com, @Miklos_C
  • 2.
    About Me: MiklosChristine Solutions Architect @ Databricks - Assist customers architect big data platforms - Help customers understand big data best practices Previously: - Systems Engineer @ Cloudera - Supported customers running a few of the largest clusters in the world - Software Engineer @ Cisco
  • 3.
    Databricks, the companybehind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 3 Data Value Created Databricks on top of Spark to make big data simple.
  • 8.
    Next Generation BigData Processing Engine
  • 9.
    • Started asa research project at UC Berkeley in 2009 • 600,000 lines of code (75% Scala) • Last Release Spark 1.6 December 2015 • Next Release Spark 2.0 • Open Source License (Apache 2.0) • Built by 1000+ developers from 200+ companies 9
  • 10.
    … Apache Spark Engine SparkCore Spark Streaming Spark SQL SparkML / MLLib Graph Frames / GraphX Unified engine across diverse workloads & environments Scale out fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 11.
    History of SparkAPIs RDD (2011) DataFrame (2013) Distribute collection of JVM objects Functional Operators (map, filter, etc.) Distribute collection of Row objects Expression-based operations and UDFs Logical plans and optimizer Fast/efficient internal representations DataSet (2015) Internally rows, externally JVM objects Almost the “Best of both worlds”: type safe + fast But slower than DF Not as good for interactive analysis, especially Python
  • 12.
    Apache Spark 2.0API DataSet (2016) • DataFrame = Dataset[Row] • Convenient for interactive analysis • Faster DataFrame DataSet Untyped API Typed API • Optimized for data engineering • Fast
  • 13.
    Benefit of LogicalPlan: Performance Parity Across Languages DataFrame RDD
  • 14.
  • 15.
    Why do MachineLearning? • Machine Learning is using computers and algorithms to recognize patterns in data • Businesses have to Adapt Faster to Change • Data driven decisions need to be made quickly and accurately • Customers expect faster responses 15
  • 16.
    From Descriptive toPredictive to Prescriptive 16 • •
  • 17.
  • 18.
    Iterate on YourModels 18 • • • •
  • 19.
  • 20.
    Why Spark ML Providegeneral purpose ML algorithms on top of Spark • Let Spark handle the distribution of data and queries; scalability • Leverage its improvements (e.g. DataFrames, Datasets, Tungsten) Advantages of MLlib’s Design: • Simplicity • Scalability • Streamlined end-to-end • Compatibility
  • 21.
    SparkML ML Pipelines provide: •Integration with DataFrames • Familiar API based on scikit-learn • Easy workflow inspection • Simple parameter tuning 21
  • 22.
    Databricks & SparkML •Use DataFrames to directly access data (SQL, raw files) • Extract, Transform and Load Data using an elastic cluster • Create the model using all of the data • Iterate many times on the model • Deploy the same model to production using the same code • Repeat
  • 23.
    Advantages for SparkML • Data can be directly accessed using the Spark Data Sources API (no more endless hours copying data between systems) • Data Scientist can use all of the data rather than subsamples and take advantage of the Law of Large numbers to improve model accuracy • Data Scientist can scale compute needs with the data size and model complexity • Data Scientists can iterate more giving them the opportunity to create better models and test and release more frequently
  • 24.
    SparkML - Tips •Understand Spark Partitions • Parquet file format and compact files • coalesce() / repartition() • Leverage Existing Functions / UDFs • Leverage DataFrames and SparkML • Iterative Algorithms • More cores for faster processing 24
  • 25.
  • 26.
    Spark 2.0 -SparkML • MLLib is deprecated and in maintenance mode • New Algorithm Support • Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler feature transformer. • PySpark Update • LDA, Gaussian Mixture Model, Generalized Linear Regression • Model Persistence across languages 26
  • 27.
  • 28.
    Thanks! Sign Up ForDatabricks Community Edition! https://databricks.com/try-databricks
  • 29.
    Learning more aboutMLlib Guides & examples • Example workflow using ML Pipelines (Python) • Power plant data analysis workflow (Scala) • The above 2 links are part of the Databricks Guide, which contains many more examples and references. References • Apache Spark MLlib User Guide • The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation. • Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper) 29