11
Ionic	Security	PUBLIC
Timothy Van Heest, Mi Yan, Robert Beatty
Ionic Security
LAYING THE FOUNDATION
FOR IONIC PLATFORM
INSIGHTS ON SPARK
• Context
• Solution
• Lessons learned
• What's next
• Why Spark + Databricks
• Questions
2
Ionic	Security	PUBLIC
Outline
Mi Yan
3
Ionic	Security	PUBLIC
Context: Who we are
Robert Beatty Timothy Van Heest
mi@ionic.com rbeatty@ionic.com timothy@ionic.com
• Data security platform
• Core is key management service
(KMS)
• End to end secure access to
encryption keys
• Real time control over key access
via granular and flexible policies
based on context
• Allows users to manage data more
granularly and with better visibility
4
Ionic	Security	PUBLIC
Context: Our company
• Visibility and insight
for users of Ionic
• Help users get more
value out of the
Ionic platform
• Our main data
source is
a transaction log of
all key create and
key access events
5
Ionic	Security	PUBLIC
Context: What we do for Ionic
1. Phase out existing solution
2. Enable low cost reporting
3. Remain flexible and agile
4. Lay foundation for advanced analytics
6
Ionic	Security	PUBLIC
Context: Our goals for this project
1. Phase out existing solution
• Multi-TB Elasticsearch cluster
• Pros
o Easy to create new visualizations and reports (flexible queries)
o Real time
• Cons
o Expensive (all data indexed)
o Poor performance (esp. for complex aggregations)
o Not well suited to ML and advanced analytics
7
Ionic	Security	PUBLIC
Context: Our goals for this project
2. Enable low cost reporting
• Scheduled batch jobs
• Data storage on S3 only
• Fully ephemeral compute
• Cost of analytics should be comparable to cost of core Ionic services
• >50x reduction in total cost of running in production
8
Ionic	Security	PUBLIC
Context: Our goals for this project
3. Flexible reporting based on use case
• The Ionic platform has many
different uses
• Need to quickly build out
domain-specific reports
and analyses
• Turn new data sources into
insights quickly
9
Ionic	Security	PUBLIC
Context: Our goals for this project
4. Lay foundation for advanced analytics
• Want to provide deeper insights,
predictions, and recommendations
• Anomaly detection, risk scoring, UEBA,
community analysis to enhance security
• Usage recommendations, mining best
practices for better user experience
• Many, many more...
10
Ionic	Security	PUBLIC
Context: Our goals for this project
Image	from	“Anomaly	detection	in	online	social	networks”,	Savage	et	al,	
2004
Image	from	“Anomaly	detection	with	Local	Outlier	Factor”,	
http://scikit-learn.org/
1111
Ionic	Security	PUBLIC
The system we built using Spark + Databricks to enable
low cost, flexible reporting and lay a foundation for advanced analytics
SOLUTION
Parquet	file	
converter
Intermediate	
dataset	
creator
Rollup	
dataset	
creator
Final		dataset	
creator
12
Ionic	Security	PUBLIC
Solution: Our core job
Log	Data:
Protobuf
encoded
Reports:
csv	and	
json
AWS	S3
Query	files:
describe	input	data,	operations	on	data,	
output
• Development process
o Sbt shell
» Run tests
» Test compilation
o Databricks notebook
» Develop new core functions
» Add new report queries
» Test against large amounts of data
o Per-user sandboxes
» Namespaced Databricks jobs
» Namespaced AWS S3 path
» In house python tool to create/update sandbox
13
Ionic	Security	PUBLIC
Solution: Our workflow: Dev
Development
Bitbucket	PR	
(git)
Jenkins	(build,	
test,	create	
artifact)
Jenkins	deploy	
(upload	artifacts	
to	AWS	S3)
Databricks job	
(daily)
Test Monitor
• Test setup
o Areas of focus
» Spark UDF, UDAF, Aggregators
» Dataframe transformations, esp. Report creators
» End-to-end test for some query types (local IO)
o Test fixtures
» Most tests use small data sets to cover edge cases
» All end-to-end tests use the same larger test fixture (~2 MB gzipped protocol
buffers)
o Running tests
» testOnly com.ionic.report.creator.EndToEndTest -- -z "categorical stats"
» Full tests are usually run just on Jenkins
» For larger changes, manually run Databricks job to process data in sandbox
14
Ionic	Security	PUBLIC
Solution: Our workflow: Test
Development
Bitbucket	PR	
(git)
Jenkins	(build,	
test,	create	
artifact)
Jenkins	deploy	
(upload	artifacts	
to	AWS	S3)
Databricks job	
(daily)
Test Monitor
• Use Atlassian Bitbucket + Jenkins
for Continuous Integration (CI)
• Jenkins build
o Automatically build on every change to
a valid feature/bugfix branch
o Run full tests
o Create artifacts (config files, uberjar)
and pack into tarball
o Upload tarball to release artifact
repository
15
Ionic	Security	PUBLIC
Solution: Our workflow: CI
“Test	Result”	page	for	Jenkins	build	job
Development
Bitbucket	PR	
(git)
Jenkins	(build,	
test,	create	
artifact)
Jenkins	deploy	
(upload	artifacts	
to	AWS	S3)
Databricks job	
(daily)
Test Monitor
• Jenkins deployment
o Auto deploy main branch to development
environment on change
o Steps
» Grab settings from AWS Param Store
» Download artifact + unpack
» Upload config / settings files to S3
» Create / update Databricks jobs
• Scheduled daily job on Databricks
o Runs latest stable version of code
16
Ionic	Security	PUBLIC
Solution: Our workflow: CI
Development
Bitbucket	PR	
(git)
Jenkins	(build,	
test,	create	
artifact)
Jenkins	deploy	
(upload	artifacts	
to	AWS	S3)
Databricks job	
(daily)
Test Monitor
“Build	with	Parameters”	page	for	Jenkins	deployment	job
• Monitoring and metering
o Spark Measure
» Library to capture Spark event log as a
Dataframe for analysis
» Includes the same data in Spark UI, but can be
analyzed in Databricks notebook
» Example usage: Dataframe caching in report
creators at final dataset step
o Additional custom per-query monitoring
» Registry of input and output files + data
formats (used to generate documentation)
» Heap snapshot (just for master node now)
» Per-query timing + custom stats
17
Ionic	Security	PUBLIC
Solution: Our workflow: Monitor
Development
Bitbucket	PR	
(git)
Jenkins	(build,	
test,	create	
artifact)
Jenkins	deploy	
(upload	artifacts	
to	AWS	S3)
Databricks job	
(daily)
Test MonitorMonitor
1818
Ionic	Security	PUBLIC
+ Rationale behind design choices
LESSONS LEARNED
• Scala/Java Spark apis get the latest and greatest
before Python/R
• Scala is deceptively easy at first and feels like
a strongly typed python but has a steep learning curve.
o Polymorphism, Option[] Type, sbt, all the syntactic sugar, etc.
are going to trip you up.
o Eg. This is a thing in scala:
• Ultimately, learning Scala was not a bad choice but be ready
for headaches and time dedicated solely to learning Scala.
o Also don’t be afraid to switch between Scala/java and Python/R.
19
Ionic	Security	PUBLIC
Lessons Learned: Scala
• Single job (Uber Jar) that manages the
whole workflow beginning to end.
o Built-in check pointing per query type and time
range allows for easy restart on failure, even
with new code commits
o Easy to get your notebook workspace setup
with the latest version of code and manage
your sbt build files/jobs in general
o Avoids complexity of something like Airflow
and allows us to use Databricks scheduler to
manage job runs
20
Ionic	Security	PUBLIC
Lessons Learned: Uber jar
Ionic	Reporting	Repo
• The job states and queries are all
config driven
o Uses pureconfig (Scala) built on typesafe
config (java)
o Configs use hocon (json for config files)
which are then parsed by pureconfig into
case classes. Has a slight learning curve.
o Our workflow was greatly improved by the
addition of tests to validate queries syntax,
perimeters, etc.
o Very flexible, can push new queries to prod
without recompiling.
21
Ionic	Security	PUBLIC
Lessons Learned: Config Driven
Example	Query:	Creates	a	bar	chart	of	
application/os usage
• All IO operations go through the HDFS driver
o we used AWS java S3 API, but we moved all these to HDFS
o simpler testing, more portable
• We process all of the tenants together in the same Dataframe.
However, we want to save one csv and JSON file per tenant.
o This was originally done with a for loop over tenant ids (uses one executor)
o Switched to “repartition” the data by tenant id then call "partitionBy" on the writer,
with tenant id as the partition key (uses all executors)
o We also have a custom partitioner for large tenants that breaks the csv/JSON into
multiple smaller files
22
Ionic	Security	PUBLIC
Lessons Learned: IO
• Idempotence Everywhere
o Had issues with getting Spark read and managed tables
working the way we wanted for our use case
» Eg. Managed tables required a full scan when updating data
o We created several helper functions to manage IO
allowing us to side-step almost all of those issues
» The helper functions decide where and what to read so we
don’t require features like filter pushdown or partitionBy for
most IO.
o Extensive tests for parts of code responsible for IO to
ensure repeatability
o Performance so far has been better than tests we did
with other options
23
Ionic	Security	PUBLIC
Lessons Learned: Idempotence
• Originally queries were mostly in SQL when we were prototyping
o Fast for developing queries without having a complex code base behind it
• We have since shifted most non-trivial queries to the Dataframe API
o Functional programming works great for large complex queries (e.g. foldLeft)
o Easier to reason about, much cleaner, and often simpler to code
o Faster on average, code should perform the same after the catalyst engine does
its magic but, writing good code in SQL is harder than in Scala
24
Ionic	Security	PUBLIC
Lessons Learned: SQL vs. Dataframes
• Aggregator, UDAF, and UDF + collect_list can each be used for
aggregating several rows into a single row
o UDF + collect_list for well bounded data (few aggregated rows); super easy to
write, fast (generally)
o Aggregator for unbounded data; easier than UDAFs to write, not first class API and
not sql compliant
o UDAF for unbounded data; future proofed and useable anywhere in Spark
• Complex logic pushed into Aggregators, Transformers, custom
columns, etc
o Easier to test, use, and reason about in code (ie. Separation of concerns)
25
Ionic	Security	PUBLIC
Lessons Learned: User Functions
• Sketches (approximate algorithms) are awesome
o We specifically use Tdigest (percentile estimation) and
HyperLogLog (cardinality estimation)
o Used by several organizations, with Twitter and Yahoo
providing excellent open source implementations
» We use algebirds’ hyperloglog implementation and our TDigest
is just an updated version of what they use (QDigest)
o Creates compact, portable, fast, and mergeable objects
with numerous useful methods
o Embraces uncertainty and incremental calculations
26
Ionic	Security	PUBLIC
Lessons Learned: Sketches
Twitter	Algebird
Approximate	Algorithms
• Real-time via Spark streaming (kinesis + redis)
• More graph analysis (user-user, key-key)
• More anomaly detection
• More job performance analysis
• Integration with core platform (UI)
• More data streams (additional data types)
27
Ionic	Security	PUBLIC
What's next
• Good general-purpose tool
o Batch and streaming
o Data engineering and ML
o Ad-hoc and production-grade
• Allows us to write much of our code with less regard for scale
• Dataframe / SQL APIs are very productive
• Huge user base means somebody else has probably run into your
problem already (and can help!)
28
Ionic	Security	PUBLIC
Why Spark?
• Easy-mode operations
• Fast time to spin up and evaluate
• Ephemeral compute => less compliance issues
• Best notebook interface
• Excellent support for Spark
• Team has been very responsive
• Quickly improving platform
29
Ionic	Security	PUBLIC
Why Databricks?
Dev:	
Notebook	
prototypes
CI:	
Integration	
tests
Prod:	
Scheduled	
jobs
Ops:	
Performance	
analysis
Roles	played	by	Databricks	
in	our	
development	pipeline
• Our goal was a low cost, high value flexible platform to
use as a foundation for our data analytics capabilities
• Spark + Databricks has been a great fit
• Our team is well positioned to quickly
build out new high value functionality
• If your needs are similar,
Spark + Databricks
is worth considering
30
Ionic	Security	PUBLIC
Summary
3131
Ionic	Security	PUBLIC
Thank you for your time and attention.
QUESTIONS?

Laying the Foundation for Ionic Platform Insights on Spark

  • 1.
    11 Ionic Security PUBLIC Timothy Van Heest,Mi Yan, Robert Beatty Ionic Security LAYING THE FOUNDATION FOR IONIC PLATFORM INSIGHTS ON SPARK
  • 2.
    • Context • Solution •Lessons learned • What's next • Why Spark + Databricks • Questions 2 Ionic Security PUBLIC Outline
  • 3.
    Mi Yan 3 Ionic Security PUBLIC Context: Whowe are Robert Beatty Timothy Van Heest mi@ionic.com rbeatty@ionic.com timothy@ionic.com
  • 4.
    • Data securityplatform • Core is key management service (KMS) • End to end secure access to encryption keys • Real time control over key access via granular and flexible policies based on context • Allows users to manage data more granularly and with better visibility 4 Ionic Security PUBLIC Context: Our company
  • 5.
    • Visibility andinsight for users of Ionic • Help users get more value out of the Ionic platform • Our main data source is a transaction log of all key create and key access events 5 Ionic Security PUBLIC Context: What we do for Ionic
  • 6.
    1. Phase outexisting solution 2. Enable low cost reporting 3. Remain flexible and agile 4. Lay foundation for advanced analytics 6 Ionic Security PUBLIC Context: Our goals for this project
  • 7.
    1. Phase outexisting solution • Multi-TB Elasticsearch cluster • Pros o Easy to create new visualizations and reports (flexible queries) o Real time • Cons o Expensive (all data indexed) o Poor performance (esp. for complex aggregations) o Not well suited to ML and advanced analytics 7 Ionic Security PUBLIC Context: Our goals for this project
  • 8.
    2. Enable lowcost reporting • Scheduled batch jobs • Data storage on S3 only • Fully ephemeral compute • Cost of analytics should be comparable to cost of core Ionic services • >50x reduction in total cost of running in production 8 Ionic Security PUBLIC Context: Our goals for this project
  • 9.
    3. Flexible reportingbased on use case • The Ionic platform has many different uses • Need to quickly build out domain-specific reports and analyses • Turn new data sources into insights quickly 9 Ionic Security PUBLIC Context: Our goals for this project
  • 10.
    4. Lay foundationfor advanced analytics • Want to provide deeper insights, predictions, and recommendations • Anomaly detection, risk scoring, UEBA, community analysis to enhance security • Usage recommendations, mining best practices for better user experience • Many, many more... 10 Ionic Security PUBLIC Context: Our goals for this project Image from “Anomaly detection in online social networks”, Savage et al, 2004 Image from “Anomaly detection with Local Outlier Factor”, http://scikit-learn.org/
  • 11.
    1111 Ionic Security PUBLIC The system webuilt using Spark + Databricks to enable low cost, flexible reporting and lay a foundation for advanced analytics SOLUTION
  • 12.
    Parquet file converter Intermediate dataset creator Rollup dataset creator Final dataset creator 12 Ionic Security PUBLIC Solution: Our corejob Log Data: Protobuf encoded Reports: csv and json AWS S3 Query files: describe input data, operations on data, output
  • 13.
    • Development process oSbt shell » Run tests » Test compilation o Databricks notebook » Develop new core functions » Add new report queries » Test against large amounts of data o Per-user sandboxes » Namespaced Databricks jobs » Namespaced AWS S3 path » In house python tool to create/update sandbox 13 Ionic Security PUBLIC Solution: Our workflow: Dev Development Bitbucket PR (git) Jenkins (build, test, create artifact) Jenkins deploy (upload artifacts to AWS S3) Databricks job (daily) Test Monitor
  • 14.
    • Test setup oAreas of focus » Spark UDF, UDAF, Aggregators » Dataframe transformations, esp. Report creators » End-to-end test for some query types (local IO) o Test fixtures » Most tests use small data sets to cover edge cases » All end-to-end tests use the same larger test fixture (~2 MB gzipped protocol buffers) o Running tests » testOnly com.ionic.report.creator.EndToEndTest -- -z "categorical stats" » Full tests are usually run just on Jenkins » For larger changes, manually run Databricks job to process data in sandbox 14 Ionic Security PUBLIC Solution: Our workflow: Test Development Bitbucket PR (git) Jenkins (build, test, create artifact) Jenkins deploy (upload artifacts to AWS S3) Databricks job (daily) Test Monitor
  • 15.
    • Use AtlassianBitbucket + Jenkins for Continuous Integration (CI) • Jenkins build o Automatically build on every change to a valid feature/bugfix branch o Run full tests o Create artifacts (config files, uberjar) and pack into tarball o Upload tarball to release artifact repository 15 Ionic Security PUBLIC Solution: Our workflow: CI “Test Result” page for Jenkins build job Development Bitbucket PR (git) Jenkins (build, test, create artifact) Jenkins deploy (upload artifacts to AWS S3) Databricks job (daily) Test Monitor
  • 16.
    • Jenkins deployment oAuto deploy main branch to development environment on change o Steps » Grab settings from AWS Param Store » Download artifact + unpack » Upload config / settings files to S3 » Create / update Databricks jobs • Scheduled daily job on Databricks o Runs latest stable version of code 16 Ionic Security PUBLIC Solution: Our workflow: CI Development Bitbucket PR (git) Jenkins (build, test, create artifact) Jenkins deploy (upload artifacts to AWS S3) Databricks job (daily) Test Monitor “Build with Parameters” page for Jenkins deployment job
  • 17.
    • Monitoring andmetering o Spark Measure » Library to capture Spark event log as a Dataframe for analysis » Includes the same data in Spark UI, but can be analyzed in Databricks notebook » Example usage: Dataframe caching in report creators at final dataset step o Additional custom per-query monitoring » Registry of input and output files + data formats (used to generate documentation) » Heap snapshot (just for master node now) » Per-query timing + custom stats 17 Ionic Security PUBLIC Solution: Our workflow: Monitor Development Bitbucket PR (git) Jenkins (build, test, create artifact) Jenkins deploy (upload artifacts to AWS S3) Databricks job (daily) Test MonitorMonitor
  • 18.
    1818 Ionic Security PUBLIC + Rationale behinddesign choices LESSONS LEARNED
  • 19.
    • Scala/Java Sparkapis get the latest and greatest before Python/R • Scala is deceptively easy at first and feels like a strongly typed python but has a steep learning curve. o Polymorphism, Option[] Type, sbt, all the syntactic sugar, etc. are going to trip you up. o Eg. This is a thing in scala: • Ultimately, learning Scala was not a bad choice but be ready for headaches and time dedicated solely to learning Scala. o Also don’t be afraid to switch between Scala/java and Python/R. 19 Ionic Security PUBLIC Lessons Learned: Scala
  • 20.
    • Single job(Uber Jar) that manages the whole workflow beginning to end. o Built-in check pointing per query type and time range allows for easy restart on failure, even with new code commits o Easy to get your notebook workspace setup with the latest version of code and manage your sbt build files/jobs in general o Avoids complexity of something like Airflow and allows us to use Databricks scheduler to manage job runs 20 Ionic Security PUBLIC Lessons Learned: Uber jar Ionic Reporting Repo
  • 21.
    • The jobstates and queries are all config driven o Uses pureconfig (Scala) built on typesafe config (java) o Configs use hocon (json for config files) which are then parsed by pureconfig into case classes. Has a slight learning curve. o Our workflow was greatly improved by the addition of tests to validate queries syntax, perimeters, etc. o Very flexible, can push new queries to prod without recompiling. 21 Ionic Security PUBLIC Lessons Learned: Config Driven Example Query: Creates a bar chart of application/os usage
  • 22.
    • All IOoperations go through the HDFS driver o we used AWS java S3 API, but we moved all these to HDFS o simpler testing, more portable • We process all of the tenants together in the same Dataframe. However, we want to save one csv and JSON file per tenant. o This was originally done with a for loop over tenant ids (uses one executor) o Switched to “repartition” the data by tenant id then call "partitionBy" on the writer, with tenant id as the partition key (uses all executors) o We also have a custom partitioner for large tenants that breaks the csv/JSON into multiple smaller files 22 Ionic Security PUBLIC Lessons Learned: IO
  • 23.
    • Idempotence Everywhere oHad issues with getting Spark read and managed tables working the way we wanted for our use case » Eg. Managed tables required a full scan when updating data o We created several helper functions to manage IO allowing us to side-step almost all of those issues » The helper functions decide where and what to read so we don’t require features like filter pushdown or partitionBy for most IO. o Extensive tests for parts of code responsible for IO to ensure repeatability o Performance so far has been better than tests we did with other options 23 Ionic Security PUBLIC Lessons Learned: Idempotence
  • 24.
    • Originally querieswere mostly in SQL when we were prototyping o Fast for developing queries without having a complex code base behind it • We have since shifted most non-trivial queries to the Dataframe API o Functional programming works great for large complex queries (e.g. foldLeft) o Easier to reason about, much cleaner, and often simpler to code o Faster on average, code should perform the same after the catalyst engine does its magic but, writing good code in SQL is harder than in Scala 24 Ionic Security PUBLIC Lessons Learned: SQL vs. Dataframes
  • 25.
    • Aggregator, UDAF,and UDF + collect_list can each be used for aggregating several rows into a single row o UDF + collect_list for well bounded data (few aggregated rows); super easy to write, fast (generally) o Aggregator for unbounded data; easier than UDAFs to write, not first class API and not sql compliant o UDAF for unbounded data; future proofed and useable anywhere in Spark • Complex logic pushed into Aggregators, Transformers, custom columns, etc o Easier to test, use, and reason about in code (ie. Separation of concerns) 25 Ionic Security PUBLIC Lessons Learned: User Functions
  • 26.
    • Sketches (approximatealgorithms) are awesome o We specifically use Tdigest (percentile estimation) and HyperLogLog (cardinality estimation) o Used by several organizations, with Twitter and Yahoo providing excellent open source implementations » We use algebirds’ hyperloglog implementation and our TDigest is just an updated version of what they use (QDigest) o Creates compact, portable, fast, and mergeable objects with numerous useful methods o Embraces uncertainty and incremental calculations 26 Ionic Security PUBLIC Lessons Learned: Sketches Twitter Algebird Approximate Algorithms
  • 27.
    • Real-time viaSpark streaming (kinesis + redis) • More graph analysis (user-user, key-key) • More anomaly detection • More job performance analysis • Integration with core platform (UI) • More data streams (additional data types) 27 Ionic Security PUBLIC What's next
  • 28.
    • Good general-purposetool o Batch and streaming o Data engineering and ML o Ad-hoc and production-grade • Allows us to write much of our code with less regard for scale • Dataframe / SQL APIs are very productive • Huge user base means somebody else has probably run into your problem already (and can help!) 28 Ionic Security PUBLIC Why Spark?
  • 29.
    • Easy-mode operations •Fast time to spin up and evaluate • Ephemeral compute => less compliance issues • Best notebook interface • Excellent support for Spark • Team has been very responsive • Quickly improving platform 29 Ionic Security PUBLIC Why Databricks? Dev: Notebook prototypes CI: Integration tests Prod: Scheduled jobs Ops: Performance analysis Roles played by Databricks in our development pipeline
  • 30.
    • Our goalwas a low cost, high value flexible platform to use as a foundation for our data analytics capabilities • Spark + Databricks has been a great fit • Our team is well positioned to quickly build out new high value functionality • If your needs are similar, Spark + Databricks is worth considering 30 Ionic Security PUBLIC Summary
  • 31.
    3131 Ionic Security PUBLIC Thank you foryour time and attention. QUESTIONS?