Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Transitioning from Java to Scala for Spark - March 13, 2019

Gravy Analytics ingests ~17 billion records daily of data and improve and refine that data into many data products at various levels of aggregation. To meet the challenges of our product requirements and scale we constantly evaluate new technologies. Spark has become central to our ability to process ever increasing amounts of data through our data factory. In late 2017 and throughout 2018, we have improved our ability to work with Spark by migrating all Spark jobs to Scala. In this discussion, we’ll cover areas which were more difficult from a Spark perspective to develop in Java than Scala as well as some of the challenges we met along the way.

  • Be the first to comment

  • Be the first to like this

Transitioning from Java to Scala for Spark - March 13, 2019

  1. 1. 1| Transitioning from Java to Scala for Spark Guy DeCorte, Founder & CTO Aaron Perrin, Senior Software Developer March 13, 2019
  2. 2. 2| Where we go is who we are. REAL-WORLD CONSUMER BEHAVIOR LIFE STAGES LIFESTYLESAFFINITIES INTERESTS The events consumers attend, the places they visit, where they spend their time, translates into intelligence
  3. 3. 3| We translate the locations that consumers visit, the places they go, and the events they attend into real-world consumer intelligence INDUSTRY-LEADING CAPABILITIES
  4. 4. 4| GRAVY SOLUTIONS AdmitOneTM verified Visitation, Attendance, Event data and more for use in unique business applications Gravy Insights provides brands with in-depth customer and competitive intelligence Gravy Audiences let marketers reach engaged consumers based on what they do in real-life GRAVY AUDIENCES GRAVY INSIGHTS GRAVY DAAS • Lifestyle • Enthusiast • In-Market • Branded • Custom • Foot Traffic • Competitive • Attribution • Visitations • Attendances • IP Address • User Agent
  5. 5. 5| Gravy’s patented AdmitOne verification engine delivers the highest-quality location and attendance data in the industry THE GRAVY DIFFERENCE Billions of daily location signals from 250M+ mobile devices The largest events database gives context to millions of places and POIs Confirmed, deterministic consumer attendances at places and events. REACH EVENTS VERIFIED
  6. 6. 6| SOLUTION GEO-SIGNALS CLOUD Distribute Filter & Verify Merge Spatial Index LCO & Attendance Algorithm Persona Generator Attendances Detail Records Personas / Audiences DevicesDevice Processing Lots of Spark jobs! Snowflake Datasets in S3 Zeppelin/EMR Snowflake SQL, R, Excel Dashboards-Sisense Matillion
  7. 7. 7| Some of the major Spark jobs that we run: • Ingest • Also validates, removes and/or flags data based on LDVS output • Location and Device VerificationService (LDVS) • Signal Merge / Device Merge • Persona Generator • Spatial Indexer SUMMARY OF SPARK JOBS
  8. 8. 8| What's Our Platform Look Like?
  9. 9. 9| • Environment • We currently run ~30 Spark jobs daily • On average, per hour: ~1300 cores and ~10 TiB memory • AWS EMR (and spot instances to control costs) • Data storage: S3 and Snowflake • The Code (Platform) • ~200k lines Java, ~30k lines Scala • Strong domain-driven-design influence • Many jobs can be run in Spark or stand-alone • Central orchestration application • Custom DAG scheduler • Responsible for job scheduling, configuring, launching, monitoring, and failure recovery THE CORE PLATFORM
  10. 10. 10| • 2015-2016 • Targets: 25M sources, 450M events per day (5500/sec) • Java - Microservices, DDD, AWS (Kinesis/SQS/EC2/DynamoDB/Redshift/etc) • 2016-2017 • Targets: 100M sources, 4B events per day (40,000/sec) • Java - Hybrid: Spark 1.6 / Microservices (experiments with storage) • 2017-2018 • Targets: 200M sources, 10B events per day (100,000/sec) • Java - Spark 2.0 / DynamoDB / S3 / Snowflake • 2018-2019+ • Targets: 400M+ sources, 25B+ events per day (300,000/sec) • Scala - Spark 2.4 / DynamoDB / S3 / Snowflake SOFTWARE ARCHITECTURE EVOLUTION
  11. 11. 11| • We started using Spark before datasets were a thing • The original Spark code was designed around RDDs • As data scaled, we targeted (easy) ways improve efficiency • After Spark 2.0+, Datasets became more attractive • What we did • Reduced size of domain types to reduce memory overhead • Refactored monolithic Spark jobs into specialized jobs • Migrated JSON data to Parquet (with partitions) • Transitioned from RDD API to Dataset API FROM RDDs TO DATASETS AND MORE
  12. 12. 12| • Transformations, aggregations, and filters are easier with Datasets • Improved Dataset performance from Spark 2.0 onward • Datasets provide an abstraction layer enabling optimized execution plans • Easier, more fluent interface • Dataset provide columnar optimization to improve data and shuffling performance • Enhanced functionality with functions._ • Support for SQL, when necessary WHY DATASETS?
  13. 13. 13| • The dataset API is available in Java so why did we switch? • Understanding Spark internals or modifying its functionality was difficult without knowing Scala • Scala is a cleanly-designed language • We wanted to avoid the (often cumbersome) Java API • Our initial experiments with Scala proved its ease of use • Case classes resulted in easier serlialization and better serialization and shuffling performance • Immutable types provided better garbage collection • Use of Spark REPL enabled faster prototyping • Scala's tools and libraries have matured significantly • Lots of best practices available • Understanding Scala gives team deeper understanding of the underlying Spark code WHY SCALA?
  14. 14. 14| • The switch was worth it - but it wasn't without a cost 1. Lack of Experience • Initially we had only one developer with Scala experience 2. Large Amounts of Legacy Java Code • We have taken a staged approach, still a large effort 3. Shift in Coding Mentality • Embracing a more functional coding style requires changing how we think about problems CHALLENGES: SCALA
  15. 15. 15| AN EXAMPLE: JAVA RDD
  17. 17. 17| UNIT TESTING • Transitioning from JUnit to ScalaTest • Lack of Experience • Another scenario where the development team needed to ramp up on new technology • DataMapper • We have a homegrown library called the DataMapper which allows us to generate test data at runtime from annotations on our unit tests • The Java version of this library relied on reflection and did not play nice with case classes • Eventually we produced a Scala / ScalaTest compatible trait-based version
  18. 18. 18| HIRING/GOING FORWARD • Driving home the fact that we are no longer a Java-only shop, we have modified our job listings to include Scala as a preferred language prerequisite. • Challenging at first to evaluate candidates' Scala skills as we were novices ourselves. • As we continue to ramp up on Scala, we have started to branch out from using it only for Spark to using it for webservices ( play framework ) as well as to replace some of our legacy utility libraries. • We think we are now better positioned to quickly take advantage of newer features coming down the spark pipeline.
  20. 20. 20| • Greatly streamlined syntax • Easier use with Spark • Easy, fast serialization of case classes during shuffles • Built-in Product type encoders • Built-in tuple types • Built-in anonymous functions • Options instead of nulls • Pattern matching instead of switch statements • IntelliJ Scala support • Simpler Futures • “Duck-typing” • Advanced reflection • Functional exception handling • Syntactic sugar • Lots of helpers: Option, Try, Success, Failure, Either, etc. • Everything is a function => more flexibility • Easier generics (less type erasure) Extra: Scala Likes
  21. 21. 21| • Untyped vals • Lots of special symbols • Library complexity • Akka and typesafe libraries • Json parsing libraries (incompatibility with Gson, complex scala libs) • Java compatibility • Companion object wrapping • Bean serialization • Default to Seq for ordered collections (instead of ideal data structure for the job) • Gradle vs. SBT • Overuse of implicit “magic” • Difficult learning curve (lots to learn!!) • Too much flexibility can create inconsistent and confusing code • Opaque compilation errors • Missing Named Tuple (e.g. Python) • Enumerations are broken Extra: Scala Dislikes
  22. 22. 22| • Immutable types instead of mutable types • Collection syntax sugar • Chaining functions causes lots of type headaches • Syntactic sugar • Using recursion (with @tailrec) instead of procedural • Pattern matching • Using small functions to keep code readable • Reflection, type tags, and class tags • Curried functions • Partial functions • Unfamiliar type system • OO Paradigms don’t translate well (have to research correct way of doing things) • Lots to learn!! Extra: Scala challenges
  23. 23. 23| Aaron Perrin, Senior Software Developer 703-840-8850