Transitioning from Java to Scala for Spark - March 13, 2019

Gravy Analytics
Gravy AnalyticsGravy Analytics
1| gravyanalytics.com
Transitioning from Java to Scala for Spark
Guy DeCorte, Founder & CTO
Aaron Perrin, Senior Software Developer
March 13, 2019
2| gravyanalytics.com
Where we go is who we are.
REAL-WORLD CONSUMER BEHAVIOR
LIFE STAGES
LIFESTYLESAFFINITIES
INTERESTS
The events consumers attend,
the places they visit,
where they spend their time,
translates into intelligence
3| gravyanalytics.com
We translate the locations that consumers visit, the places they go, and the
events they attend into real-world consumer intelligence
INDUSTRY-LEADING CAPABILITIES
4| gravyanalytics.com
GRAVY SOLUTIONS
AdmitOneTM verified
Visitation, Attendance,
Event data and more for use
in unique business
applications
Gravy Insights provides
brands with in-depth
customer and competitive
intelligence
Gravy Audiences let
marketers reach engaged
consumers based on what
they do in real-life
GRAVY AUDIENCES GRAVY INSIGHTS GRAVY DAAS
• Lifestyle • Enthusiast
• In-Market • Branded • Custom
• Foot Traffic • Competitive
• Attribution
• Visitations • Attendances
• IP Address • User Agent
5| gravyanalytics.com
Gravy’s patented AdmitOne verification engine delivers the
highest-quality location and attendance data in the industry
THE GRAVY DIFFERENCE
Billions of daily location
signals from 250M+ mobile
devices
The largest events
database gives context to
millions of places and POIs
Confirmed, deterministic
consumer attendances at
places and events.
REACH EVENTS VERIFIED
6| gravyanalytics.com
SOLUTION
GEO-SIGNALS
CLOUD
Distribute
Filter & Verify Merge
Spatial Index
LCO & Attendance
Algorithm
Persona Generator
Attendances
Detail Records
Personas /
Audiences
DevicesDevice Processing
Lots of Spark jobs!
Snowflake
Datasets in S3
Zeppelin/EMR
Snowflake
SQL, R, Excel Dashboards-Sisense
Matillion
7| gravyanalytics.com
Some of the major Spark jobs that we run:
• Ingest
• Also validates, removes and/or flags data based on LDVS output
• Location and Device VerificationService (LDVS)
• Signal Merge / Device Merge
• Persona Generator
• Spatial Indexer
SUMMARY OF SPARK JOBS
8| gravyanalytics.com
What's Our Platform Look Like?
9| gravyanalytics.com
• Environment
• We currently run ~30 Spark jobs daily
• On average, per hour: ~1300 cores and ~10 TiB memory
• AWS EMR (and spot instances to control costs)
• Data storage: S3 and Snowflake
• The Code (Platform)
• ~200k lines Java, ~30k lines Scala
• Strong domain-driven-design influence
• Many jobs can be run in Spark or stand-alone
• Central orchestration application
• Custom DAG scheduler
• Responsible for job scheduling, configuring, launching,
monitoring, and failure recovery
THE CORE PLATFORM
10| gravyanalytics.com
• 2015-2016
• Targets: 25M sources, 450M events per day (5500/sec)
• Java - Microservices, DDD, AWS (Kinesis/SQS/EC2/DynamoDB/Redshift/etc)
• 2016-2017
• Targets: 100M sources, 4B events per day (40,000/sec)
• Java - Hybrid: Spark 1.6 / Microservices (experiments with storage)
• 2017-2018
• Targets: 200M sources, 10B events per day (100,000/sec)
• Java - Spark 2.0 / DynamoDB / S3 / Snowflake
• 2018-2019+
• Targets: 400M+ sources, 25B+ events per day (300,000/sec)
• Scala - Spark 2.4 / DynamoDB / S3 / Snowflake
SOFTWARE ARCHITECTURE EVOLUTION
11| gravyanalytics.com
• We started using Spark before datasets were a thing
• The original Spark code was designed around RDDs
• As data scaled, we targeted (easy) ways improve efficiency
• After Spark 2.0+, Datasets became more attractive
• What we did
• Reduced size of domain types to reduce memory overhead
• Refactored monolithic Spark jobs into specialized jobs
• Migrated JSON data to Parquet (with partitions)
• Transitioned from RDD API to Dataset API
FROM RDDs TO DATASETS AND MORE
12| gravyanalytics.com
• Transformations, aggregations, and filters
are easier with Datasets
• Improved Dataset performance from Spark
2.0 onward
• Datasets provide an abstraction layer
enabling optimized execution plans
• Easier, more fluent interface
• Dataset provide columnar optimization to
improve data and shuffling performance
• Enhanced functionality with functions._
• Support for SQL, when necessary
WHY DATASETS?
13| gravyanalytics.com
• The dataset API is available in Java so why
did we switch?
• Understanding Spark internals or modifying its
functionality was difficult without knowing Scala
• Scala is a cleanly-designed language
• We wanted to avoid the (often cumbersome) Java API
• Our initial experiments with Scala proved its ease of use
• Case classes resulted in easier serlialization and better
serialization and shuffling performance
• Immutable types provided better garbage collection
• Use of Spark REPL enabled faster prototyping
• Scala's tools and libraries have matured significantly
• Lots of best practices available
• Understanding Scala gives team deeper understanding of
the underlying Spark code
WHY SCALA?
14| gravyanalytics.com
• The switch was worth it - but it
wasn't without a cost
1. Lack of Experience
• Initially we had only one developer with
Scala experience
2. Large Amounts of Legacy Java Code
• We have taken a staged approach, still a
large effort
3. Shift in Coding Mentality
• Embracing a more functional coding style
requires changing how we think about
problems
CHALLENGES: SCALA
15| gravyanalytics.com
AN EXAMPLE: JAVA RDD
16| gravyanalytics.com
AN EXAMPLE: SCALA DATASET
17| gravyanalytics.com
UNIT TESTING
• Transitioning from JUnit to
ScalaTest
• Lack of Experience
• Another scenario where the development team
needed to ramp up on new technology
• DataMapper
• We have a homegrown library called the
DataMapper which allows us to generate test data
at runtime from annotations on our unit tests
• The Java version of this library relied on
reflection and did not play nice with case classes
• Eventually we produced a Scala / ScalaTest
compatible trait-based version
18| gravyanalytics.com
HIRING/GOING FORWARD
• Driving home the fact that we are no longer a Java-only shop, we have modified our
job listings to include Scala as a preferred language prerequisite.
• Challenging at first to evaluate candidates' Scala skills as we were novices ourselves.
• As we continue to ramp up on Scala, we have started to branch out from using it only
for Spark to using it for webservices ( play framework ) as well as to replace some of
our legacy utility libraries.
• We think we are now better positioned to quickly take advantage of newer features
coming down the spark pipeline.
19| gravyanalytics.com
DISCUSSION
QUESTIONS?
20| gravyanalytics.com
• Greatly streamlined syntax
• Easier use with Spark
• Easy, fast serialization of case classes during shuffles
• Built-in Product type encoders
• Built-in tuple types
• Built-in anonymous functions
• Options instead of nulls
• Pattern matching instead of switch statements
• IntelliJ Scala support
• Simpler Futures
• “Duck-typing”
• Advanced reflection
• Functional exception handling
• Syntactic sugar
• Lots of helpers: Option, Try, Success, Failure, Either, etc.
• Everything is a function => more flexibility
• Easier generics (less type erasure)
Extra: Scala Likes
21| gravyanalytics.com
• Untyped vals
• Lots of special symbols
• Library complexity
• Akka and typesafe libraries
• Json parsing libraries (incompatibility with Gson, complex scala libs)
• Java compatibility
• Companion object wrapping
• Bean serialization
• Default to Seq for ordered collections (instead of ideal data structure for the job)
• Gradle vs. SBT
• Overuse of implicit “magic”
• Difficult learning curve (lots to learn!!)
• Too much flexibility can create inconsistent and confusing code
• Opaque compilation errors
• Missing Named Tuple (e.g. Python)
• Enumerations are broken
Extra: Scala Dislikes
22| gravyanalytics.com
• Immutable types instead of mutable types
• Collection syntax sugar
• Chaining functions causes lots of type headaches
• Syntactic sugar
• Using recursion (with @tailrec) instead of procedural
• Pattern matching
• Using small functions to keep code readable
• Reflection, type tags, and class tags
• Curried functions
• Partial functions
• Unfamiliar type system
• OO Paradigms don’t translate well (have to research correct way of doing things)
• Lots to learn!!
Extra: Scala challenges
23| gravyanalytics.com
Aaron Perrin, Senior Software Developer
703-840-8850
aperrin@gravyanalytics.com
1 of 23

Recommended

Introducing Apache Kafka and why it is important to Oracle, Java and IT profe... by
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...Lucas Jellema
1.7K views50 slides
What is happening with my microservices? by
What is happening with my microservices?What is happening with my microservices?
What is happening with my microservices?Israel Blancas
107 views45 slides
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr... by
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...Lucas Jellema
566 views109 slides
Kafka Summit SF 2017 - Worldwide Scalable and Resilient Messaging Services wi... by
Kafka Summit SF 2017 - Worldwide Scalable and Resilient Messaging Services wi...Kafka Summit SF 2017 - Worldwide Scalable and Resilient Messaging Services wi...
Kafka Summit SF 2017 - Worldwide Scalable and Resilient Messaging Services wi...confluent
1.1K views38 slides
Hybrid Apache Spark Architecture with YARN and Kubernetes by
Hybrid Apache Spark Architecture with YARN and KubernetesHybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and KubernetesDatabricks
404 views23 slides
Sneaking Scala through the Back Door by
Sneaking Scala through the Back DoorSneaking Scala through the Back Door
Sneaking Scala through the Back DoorDianne Marsh
10.7K views35 slides

More Related Content

What's hot

50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS... by
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...Lucas Jellema
1.9K views252 slides
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia by
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
928 views23 slides
Microservices in the Enterprise by
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise Jesus Rodriguez
2.7K views63 slides
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala by
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaScala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaLightbend
1.3K views20 slides
The (not so) Dark Art of Atlassian Performance Tuning by
The (not so) Dark Art of Atlassian Performance TuningThe (not so) Dark Art of Atlassian Performance Tuning
The (not so) Dark Art of Atlassian Performance Tuningcolleenfry
3.7K views109 slides
Cisco's MultiCloud Strategy by
Cisco's MultiCloud StrategyCisco's MultiCloud Strategy
Cisco's MultiCloud StrategyMaulik Shyani
1.3K views38 slides

What's hot(20)

50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS... by Lucas Jellema
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
Lucas Jellema1.9K views
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia by Jen Aman
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman928 views
Microservices in the Enterprise by Jesus Rodriguez
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise
Jesus Rodriguez2.7K views
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala by Lightbend
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaScala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Lightbend1.3K views
The (not so) Dark Art of Atlassian Performance Tuning by colleenfry
The (not so) Dark Art of Atlassian Performance TuningThe (not so) Dark Art of Atlassian Performance Tuning
The (not so) Dark Art of Atlassian Performance Tuning
colleenfry3.7K views
Cisco's MultiCloud Strategy by Maulik Shyani
Cisco's MultiCloud StrategyCisco's MultiCloud Strategy
Cisco's MultiCloud Strategy
Maulik Shyani1.3K views
Business and IT agility through DevOps and microservice architecture powered ... by Lucas Jellema
Business and IT agility through DevOps and microservice architecture powered ...Business and IT agility through DevOps and microservice architecture powered ...
Business and IT agility through DevOps and microservice architecture powered ...
Lucas Jellema306 views
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus... by Chocolatey Software
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Scala and Spark are Ideal for Big Data by John Nestor
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
John Nestor350 views
Big ideas in small packages - How microservices helped us to scale our vision by Sebastian Schleicher
Big ideas in small packages  - How microservices helped us to scale our visionBig ideas in small packages  - How microservices helped us to scale our vision
Big ideas in small packages - How microservices helped us to scale our vision
Liferay & Big Data Dev Con 2014 by Miguel Pastor
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
Miguel Pastor1.4K views
Cloudstate - Towards Stateful Serverless by Lightbend
Cloudstate - Towards Stateful ServerlessCloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful Serverless
Lightbend2.3K views
Automated Configuration & Deployment of Atlassian Applications by colleenfry
Automated Configuration & Deployment of Atlassian ApplicationsAutomated Configuration & Deployment of Atlassian Applications
Automated Configuration & Deployment of Atlassian Applications
colleenfry1.5K views
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod... by Lucas Jellema
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
Lucas Jellema560 views
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017) by Lucas Jellema
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Lucas Jellema2.6K views
Agile, User Stories, Domain Driven Design by Araf Karsh Hamid
Agile, User Stories, Domain Driven DesignAgile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven Design
Araf Karsh Hamid1.2K views
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value by Scout RFP
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate ValueIt’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
Scout RFP256 views
Yow Conference Dec 2013 Netflix Workshop Slides with Notes by Adrian Cockcroft
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Adrian Cockcroft49.3K views

Similar to Transitioning from Java to Scala for Spark - March 13, 2019

IncQuery Server for Teamwork Cloud - Talk at IW2019 by
IncQuery Server for Teamwork Cloud - Talk at IW2019IncQuery Server for Teamwork Cloud - Talk at IW2019
IncQuery Server for Teamwork Cloud - Talk at IW2019Istvan Rath
663 views16 slides
Whitepages Practical Experience Converting from Ruby to Reactive by
Whitepages Practical Experience Converting from Ruby to ReactiveWhitepages Practical Experience Converting from Ruby to Reactive
Whitepages Practical Experience Converting from Ruby to ReactiveDragos Manolescu
1.1K views36 slides
Experience Converting from Ruby to Scala by
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaJohn Nestor
1K views36 slides
Stardog 1.1: An Easier, Smarter, Faster RDF Database by
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Databasekendallclark
368 views34 slides
Stardog 1.1: Easier, Smarter, Faster RDF Database by
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseStardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseClark & Parsia LLC
2.5K views34 slides
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East... by
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
817 views29 slides

Similar to Transitioning from Java to Scala for Spark - March 13, 2019(20)

IncQuery Server for Teamwork Cloud - Talk at IW2019 by Istvan Rath
IncQuery Server for Teamwork Cloud - Talk at IW2019IncQuery Server for Teamwork Cloud - Talk at IW2019
IncQuery Server for Teamwork Cloud - Talk at IW2019
Istvan Rath663 views
Whitepages Practical Experience Converting from Ruby to Reactive by Dragos Manolescu
Whitepages Practical Experience Converting from Ruby to ReactiveWhitepages Practical Experience Converting from Ruby to Reactive
Whitepages Practical Experience Converting from Ruby to Reactive
Dragos Manolescu1.1K views
Experience Converting from Ruby to Scala by John Nestor
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to Scala
John Nestor1K views
Stardog 1.1: An Easier, Smarter, Faster RDF Database by kendallclark
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Database
kendallclark368 views
Stardog 1.1: Easier, Smarter, Faster RDF Database by Clark & Parsia LLC
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseStardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF Database
Clark & Parsia LLC2.5K views
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East... by Spark Summit
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit817 views
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle by Domino Data Lab
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Domino Data Lab 5.1K views
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRA by zAgile
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRAWikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
zAgile2.5K views
Evolving IGN’s New APIs with Scala by Manish Pandit
 Evolving IGN’s New APIs with Scala Evolving IGN’s New APIs with Scala
Evolving IGN’s New APIs with Scala
Manish Pandit1.1K views
Pig on Spark by mortardata
Pig on SparkPig on Spark
Pig on Spark
mortardata2.8K views
Introduction to Apache Geode (Cork, Ireland) by Anthony Baker
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
Anthony Baker852 views
Apache Geode Meetup, Cork, Ireland at CIT by Apache Geode
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode1.5K views
Play Architecture, Implementation, Shiny Objects, and a Proposal by Mike Slinn
Play Architecture, Implementation, Shiny Objects, and a ProposalPlay Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a Proposal
Mike Slinn749 views
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal... by Thoughtworks
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks2.3K views
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal... by Thoughtworks
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks8.5K views
Stay productive_while_slicing_up_the_monolith by Markus Eisele
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
Markus Eisele592 views
Spark introduction and architecture by Sohil Jain
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain266 views

Recently uploaded

Penetration testing by Burpsuite by
Penetration testing by  BurpsuitePenetration testing by  Burpsuite
Penetration testing by BurpsuiteAyonDebnathCertified
5 views19 slides
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...DataScienceConferenc1
8 views36 slides
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...StatsCommunications
7 views26 slides
Product Research sample.pdf by
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdfAllenSingson
33 views29 slides
Ukraine Infographic_22NOV2023_v2.pdf by
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdfAnastosiyaGurin
1.4K views3 slides
Dr. Ousmane Badiane-2023 ReSAKSS Conference by
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceAKADEMIYA2063
5 views34 slides

Recently uploaded(20)

[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson33 views
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
Customer Data Cleansing Project.pptx by Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 views
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 views
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views

Transitioning from Java to Scala for Spark - March 13, 2019

  • 1. 1| gravyanalytics.com Transitioning from Java to Scala for Spark Guy DeCorte, Founder & CTO Aaron Perrin, Senior Software Developer March 13, 2019
  • 2. 2| gravyanalytics.com Where we go is who we are. REAL-WORLD CONSUMER BEHAVIOR LIFE STAGES LIFESTYLESAFFINITIES INTERESTS The events consumers attend, the places they visit, where they spend their time, translates into intelligence
  • 3. 3| gravyanalytics.com We translate the locations that consumers visit, the places they go, and the events they attend into real-world consumer intelligence INDUSTRY-LEADING CAPABILITIES
  • 4. 4| gravyanalytics.com GRAVY SOLUTIONS AdmitOneTM verified Visitation, Attendance, Event data and more for use in unique business applications Gravy Insights provides brands with in-depth customer and competitive intelligence Gravy Audiences let marketers reach engaged consumers based on what they do in real-life GRAVY AUDIENCES GRAVY INSIGHTS GRAVY DAAS • Lifestyle • Enthusiast • In-Market • Branded • Custom • Foot Traffic • Competitive • Attribution • Visitations • Attendances • IP Address • User Agent
  • 5. 5| gravyanalytics.com Gravy’s patented AdmitOne verification engine delivers the highest-quality location and attendance data in the industry THE GRAVY DIFFERENCE Billions of daily location signals from 250M+ mobile devices The largest events database gives context to millions of places and POIs Confirmed, deterministic consumer attendances at places and events. REACH EVENTS VERIFIED
  • 6. 6| gravyanalytics.com SOLUTION GEO-SIGNALS CLOUD Distribute Filter & Verify Merge Spatial Index LCO & Attendance Algorithm Persona Generator Attendances Detail Records Personas / Audiences DevicesDevice Processing Lots of Spark jobs! Snowflake Datasets in S3 Zeppelin/EMR Snowflake SQL, R, Excel Dashboards-Sisense Matillion
  • 7. 7| gravyanalytics.com Some of the major Spark jobs that we run: • Ingest • Also validates, removes and/or flags data based on LDVS output • Location and Device VerificationService (LDVS) • Signal Merge / Device Merge • Persona Generator • Spatial Indexer SUMMARY OF SPARK JOBS
  • 8. 8| gravyanalytics.com What's Our Platform Look Like?
  • 9. 9| gravyanalytics.com • Environment • We currently run ~30 Spark jobs daily • On average, per hour: ~1300 cores and ~10 TiB memory • AWS EMR (and spot instances to control costs) • Data storage: S3 and Snowflake • The Code (Platform) • ~200k lines Java, ~30k lines Scala • Strong domain-driven-design influence • Many jobs can be run in Spark or stand-alone • Central orchestration application • Custom DAG scheduler • Responsible for job scheduling, configuring, launching, monitoring, and failure recovery THE CORE PLATFORM
  • 10. 10| gravyanalytics.com • 2015-2016 • Targets: 25M sources, 450M events per day (5500/sec) • Java - Microservices, DDD, AWS (Kinesis/SQS/EC2/DynamoDB/Redshift/etc) • 2016-2017 • Targets: 100M sources, 4B events per day (40,000/sec) • Java - Hybrid: Spark 1.6 / Microservices (experiments with storage) • 2017-2018 • Targets: 200M sources, 10B events per day (100,000/sec) • Java - Spark 2.0 / DynamoDB / S3 / Snowflake • 2018-2019+ • Targets: 400M+ sources, 25B+ events per day (300,000/sec) • Scala - Spark 2.4 / DynamoDB / S3 / Snowflake SOFTWARE ARCHITECTURE EVOLUTION
  • 11. 11| gravyanalytics.com • We started using Spark before datasets were a thing • The original Spark code was designed around RDDs • As data scaled, we targeted (easy) ways improve efficiency • After Spark 2.0+, Datasets became more attractive • What we did • Reduced size of domain types to reduce memory overhead • Refactored monolithic Spark jobs into specialized jobs • Migrated JSON data to Parquet (with partitions) • Transitioned from RDD API to Dataset API FROM RDDs TO DATASETS AND MORE
  • 12. 12| gravyanalytics.com • Transformations, aggregations, and filters are easier with Datasets • Improved Dataset performance from Spark 2.0 onward • Datasets provide an abstraction layer enabling optimized execution plans • Easier, more fluent interface • Dataset provide columnar optimization to improve data and shuffling performance • Enhanced functionality with functions._ • Support for SQL, when necessary WHY DATASETS?
  • 13. 13| gravyanalytics.com • The dataset API is available in Java so why did we switch? • Understanding Spark internals or modifying its functionality was difficult without knowing Scala • Scala is a cleanly-designed language • We wanted to avoid the (often cumbersome) Java API • Our initial experiments with Scala proved its ease of use • Case classes resulted in easier serlialization and better serialization and shuffling performance • Immutable types provided better garbage collection • Use of Spark REPL enabled faster prototyping • Scala's tools and libraries have matured significantly • Lots of best practices available • Understanding Scala gives team deeper understanding of the underlying Spark code WHY SCALA?
  • 14. 14| gravyanalytics.com • The switch was worth it - but it wasn't without a cost 1. Lack of Experience • Initially we had only one developer with Scala experience 2. Large Amounts of Legacy Java Code • We have taken a staged approach, still a large effort 3. Shift in Coding Mentality • Embracing a more functional coding style requires changing how we think about problems CHALLENGES: SCALA
  • 17. 17| gravyanalytics.com UNIT TESTING • Transitioning from JUnit to ScalaTest • Lack of Experience • Another scenario where the development team needed to ramp up on new technology • DataMapper • We have a homegrown library called the DataMapper which allows us to generate test data at runtime from annotations on our unit tests • The Java version of this library relied on reflection and did not play nice with case classes • Eventually we produced a Scala / ScalaTest compatible trait-based version
  • 18. 18| gravyanalytics.com HIRING/GOING FORWARD • Driving home the fact that we are no longer a Java-only shop, we have modified our job listings to include Scala as a preferred language prerequisite. • Challenging at first to evaluate candidates' Scala skills as we were novices ourselves. • As we continue to ramp up on Scala, we have started to branch out from using it only for Spark to using it for webservices ( play framework ) as well as to replace some of our legacy utility libraries. • We think we are now better positioned to quickly take advantage of newer features coming down the spark pipeline.
  • 20. 20| gravyanalytics.com • Greatly streamlined syntax • Easier use with Spark • Easy, fast serialization of case classes during shuffles • Built-in Product type encoders • Built-in tuple types • Built-in anonymous functions • Options instead of nulls • Pattern matching instead of switch statements • IntelliJ Scala support • Simpler Futures • “Duck-typing” • Advanced reflection • Functional exception handling • Syntactic sugar • Lots of helpers: Option, Try, Success, Failure, Either, etc. • Everything is a function => more flexibility • Easier generics (less type erasure) Extra: Scala Likes
  • 21. 21| gravyanalytics.com • Untyped vals • Lots of special symbols • Library complexity • Akka and typesafe libraries • Json parsing libraries (incompatibility with Gson, complex scala libs) • Java compatibility • Companion object wrapping • Bean serialization • Default to Seq for ordered collections (instead of ideal data structure for the job) • Gradle vs. SBT • Overuse of implicit “magic” • Difficult learning curve (lots to learn!!) • Too much flexibility can create inconsistent and confusing code • Opaque compilation errors • Missing Named Tuple (e.g. Python) • Enumerations are broken Extra: Scala Dislikes
  • 22. 22| gravyanalytics.com • Immutable types instead of mutable types • Collection syntax sugar • Chaining functions causes lots of type headaches • Syntactic sugar • Using recursion (with @tailrec) instead of procedural • Pattern matching • Using small functions to keep code readable • Reflection, type tags, and class tags • Curried functions • Partial functions • Unfamiliar type system • OO Paradigms don’t translate well (have to research correct way of doing things) • Lots to learn!! Extra: Scala challenges
  • 23. 23| gravyanalytics.com Aaron Perrin, Senior Software Developer 703-840-8850 aperrin@gravyanalytics.com