Fighting Fraud with Apache Spark

Fighting Fraud in Medicare
with Apache Spark
Miklos Christine
Solutions Architect
mwc@databricks.com, @Miklos_C

About Me: Miklos Christine
Solutions Architect @ Databricks
- Assist customers architect big data platforms
- Help customers understand big data best practices
Previously:
- Systems Engineer @ Cloudera
- Supported customers running a few of the largest clusters in the
world
- Software Engineer @ Cisco

Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.

Next Generation Big Data Processing Engine

• Started as a research project at UC Berkeley in 2009
• 600,000 lines of code (75% Scala)
• Last Release Spark 1.6 December 2015
• Next Release Spark 2.0
• Open Source License (Apache 2.0)
• Built by 1000+ developers from 200+ companies
9

…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL
SparkML /
MLLib
Graph
Frames /
GraphX
Unified engine across diverse workloads &
environments
Scale out
fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries

History of Spark APIs
RDD
(2011)
DataFrame
(2013)
Distribute collection
of JVM objects
Functional Operators (map,
filter, etc.)
Distribute collection
of Row objects
Expression-based
operations and UDFs
Logical plans and optimizer
Fast/efficient internal
representations
DataSet
(2015)
Internally rows, externally
JVM objects
Almost the “Best of both
worlds”: type safe + fast
But slower than DF
Not as good for interactive
analysis, especially Python

Apache Spark 2.0 API
DataSet
(2016)
• DataFrame = Dataset[Row]
• Convenient for interactive analysis
• Faster
DataFrame
DataSet
Untyped API
Typed API
• Optimized for data engineering
• Fast

Benefit of Logical Plan:
Performance Parity Across Languages
DataFrame
RDD

Machine Learning with
Apache Spark

Why do Machine Learning?
• Machine Learning is using
computers and algorithms to
recognize patterns in data
• Businesses have to Adapt Faster to
Change
• Data driven decisions need to be
made quickly and accurately
• Customers expect faster responses
15

From Descriptive to Predictive to Prescriptive
16
•
•

Iterate on Your Models
18
•
•
•
•

Why Spark ML
Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)
Advantages of MLlib’s Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility

SparkML
ML Pipelines provide:
• Integration with DataFrames
• Familiar API based on
scikit-learn
• Easy workflow inspection
• Simple parameter tuning
21

Databricks & SparkML
• Use DataFrames to directly access data (SQL, raw files)
• Extract, Transform and Load Data using an elastic cluster
• Create the model using all of the data
• Iterate many times on the model
• Deploy the same model to production using the same code
• Repeat

Advantages for Spark ML
• Data can be directly accessed using the Spark Data Sources API (no more endless
hours copying data between systems)
• Data Scientist can use all of the data rather than subsamples and take advantage of
the Law of Large numbers to improve model accuracy
• Data Scientist can scale compute needs with the data size and model complexity
• Data Scientists can iterate more giving them the opportunity to create better models
and test and release more frequently

SparkML - Tips
• Understand Spark Partitions
• Parquet file format and compact files
• coalesce() / repartition()
• Leverage Existing Functions / UDFs
• Leverage DataFrames and SparkML
• Iterative Algorithms
• More cores for faster processing
24

Spark 2.0 - SparkML
• MLLib is deprecated and in maintenance mode
• New Algorithm Support
• Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler
feature transformer.
• PySpark Update
• LDA, Gaussian Mixture Model, Generalized Linear Regression
• Model Persistence across languages
26

Thanks!
Sign Up For Databricks Community Edition!
https://databricks.com/try-databricks

Learning more about MLlib
Guides & examples
• Example workflow using ML Pipelines (Python)
• Power plant data analysis workflow (Scala)
• The above 2 links are part of the Databricks Guide, which contains many more
examples and references.
References
• Apache Spark MLlib User Guide
• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API
documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015.
http://arxiv.org/abs/1505.06807 (academic paper)
29

Fighting Fraud with Apache Spark

More Related Content

What's hot

Viewers also liked

Similar to Fighting Fraud with Apache Spark

Recently uploaded

Fighting Fraud with Apache Spark