Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scala - THE language for Big Data


Published on

The Big Data ecosystem has matured. Idioms such as eventual-consistency, immutability, CAP theorem and many more have been researched and successfully implemented in various Big Data tools and systems. In recent years, characteristics of Big Data systems started infiltrating the lower levels of design, all the way down to the choice of language. In this light, Scala - the Object Oriented, Strongly-Typed, Functional language - started to shine as a perfect fit for this environment, with tools like Apache Spark attesting to its benefits. In this talk I'll try to share my view of why Scala is THE language for Big Data processing, with some real-world examples of the advantages this combination creates.

Full version (with animations):

Published in: Software
  • Be the first to comment

Scala - THE language for Big Data

  1. 1. private static class Person { String firstName; String lastName; } private List<Person> firstNFamilies(int n, List<Person> persons) { final List<String> familiesSoFar = new LinkedList<>(); final List<Person> result = new LinkedList<>(); for (Person p : persons) { if (familiesSoFar.contains(p.lastName)) { result.add(p); } else if (familiesSoFar.size() < n) { familiesSoFar.add(p.lastName); result.add(p); } } return result; } case class Person(firstName: String, lastName: String) def firstNFamilies(n: Int, persons: List[Person]): List[Person] = { val firstFamilies = => p.lastName).distinct.take(n) persons.filter(p => firstFamilies.contains(p.lastName)) }
  2. 2. class DirectParquetOutputCommitter(outputPath: Path, context: TaskAttemptContext) extends ParquetOutputCommitter(outputPath, context) { … } Java class from org.apache.parquet:parquet-hadoop Scala class from org.apache.spark:spark-core_2.10
  3. 3. Nonsense! No Way! RAGE!!11
  4. 4.
  5. 5.
  6. 6. val numbers = 1 to 100000 val result =
  7. 7. val numbers = 1 to 100000 val result = Parallelizes next manipulations over available CPUs
  8. 8. val numbers = 1 to 100000 val result = sparkContext.parallelize(numbers).map(slowF) Parallelizes next manipulations over scalable cluster, by creating a Spark RDD - a Resilient Distributed Dataset
  9. 9. photo:
  10. 10. Map Map MapMap Map (retry)