1. Fighting Fraud in Medicare
with Apache Spark
Miklos Christine
Solutions Architect
mwc@databricks.com, @Miklos_C
2. About Me: Miklos Christine
Solutions Architect @ Databricks
- Assist customers architect big data platforms
- Help customers understand big data best practices
Previously:
- Systems Engineer @ Cloudera
- Supported customers running a few of the largest clusters in the
world
- Software Engineer @ Cisco
3. Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.
9. • Started as a research project at UC Berkeley in 2009
• 600,000 lines of code (75% Scala)
• Last Release Spark 1.6 December 2015
• Next Release Spark 2.0
• Open Source License (Apache 2.0)
• Built by 1000+ developers from 200+ companies
9
10. …
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL
SparkML /
MLLib
Graph
Frames /
GraphX
Unified engine across diverse workloads &
environments
Scale out
fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries
11. History of Spark APIs
RDD
(2011)
DataFrame
(2013)
Distribute collection
of JVM objects
Functional Operators (map,
filter, etc.)
Distribute collection
of Row objects
Expression-based
operations and UDFs
Logical plans and optimizer
Fast/efficient internal
representations
DataSet
(2015)
Internally rows, externally
JVM objects
Almost the “Best of both
worlds”: type safe + fast
But slower than DF
Not as good for interactive
analysis, especially Python
12. Apache Spark 2.0 API
DataSet
(2016)
• DataFrame = Dataset[Row]
• Convenient for interactive analysis
• Faster
DataFrame
DataSet
Untyped API
Typed API
• Optimized for data engineering
• Fast
13. Benefit of Logical Plan:
Performance Parity Across Languages
DataFrame
RDD
15. Why do Machine Learning?
• Machine Learning is using
computers and algorithms to
recognize patterns in data
• Businesses have to Adapt Faster to
Change
• Data driven decisions need to be
made quickly and accurately
• Customers expect faster responses
15
20. Why Spark ML
Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)
Advantages of MLlib’s Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility
21. SparkML
ML Pipelines provide:
• Integration with DataFrames
• Familiar API based on
scikit-learn
• Easy workflow inspection
• Simple parameter tuning
21
22. Databricks & SparkML
• Use DataFrames to directly access data (SQL, raw files)
• Extract, Transform and Load Data using an elastic cluster
• Create the model using all of the data
• Iterate many times on the model
• Deploy the same model to production using the same code
• Repeat
23. Advantages for Spark ML
• Data can be directly accessed using the Spark Data Sources API (no more endless
hours copying data between systems)
• Data Scientist can use all of the data rather than subsamples and take advantage of
the Law of Large numbers to improve model accuracy
• Data Scientist can scale compute needs with the data size and model complexity
• Data Scientists can iterate more giving them the opportunity to create better models
and test and release more frequently
24. SparkML - Tips
• Understand Spark Partitions
• Parquet file format and compact files
• coalesce() / repartition()
• Leverage Existing Functions / UDFs
• Leverage DataFrames and SparkML
• Iterative Algorithms
• More cores for faster processing
24
26. Spark 2.0 - SparkML
• MLLib is deprecated and in maintenance mode
• New Algorithm Support
• Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler
feature transformer.
• PySpark Update
• LDA, Gaussian Mixture Model, Generalized Linear Regression
• Model Persistence across languages
26
28. Thanks!
Sign Up For Databricks Community Edition!
https://databricks.com/try-databricks
29. Learning more about MLlib
Guides & examples
• Example workflow using ML Pipelines (Python)
• Power plant data analysis workflow (Scala)
• The above 2 links are part of the Databricks Guide, which contains many more
examples and references.
References
• Apache Spark MLlib User Guide
• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API
documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015.
http://arxiv.org/abs/1505.06807 (academic paper)
29