Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scala - THE language for Big Data

1,611 views

Published on

The Big Data ecosystem has matured. Idioms such as eventual-consistency, immutability, CAP theorem and many more have been researched and successfully implemented in various Big Data tools and systems. In recent years, characteristics of Big Data systems started infiltrating the lower levels of design, all the way down to the choice of language. In this light, Scala - the Object Oriented, Strongly-Typed, Functional language - started to shine as a perfect fit for this environment, with tools like Apache Spark attesting to its benefits. In this talk I'll try to share my view of why Scala is THE language for Big Data processing, with some real-world examples of the advantages this combination creates.

Full version (with animations): https://docs.google.com/presentation/d/1m4_BBXQKbkaGWImFRwa33HFVEuTcOdwa4kY80dVpvAg

Published in: Software
  • Be the first to comment

Scala - THE language for Big Data

  1. 1. private static class Person { String firstName; String lastName; } private List<Person> firstNFamilies(int n, List<Person> persons) { final List<String> familiesSoFar = new LinkedList<>(); final List<Person> result = new LinkedList<>(); for (Person p : persons) { if (familiesSoFar.contains(p.lastName)) { result.add(p); } else if (familiesSoFar.size() < n) { familiesSoFar.add(p.lastName); result.add(p); } } return result; } case class Person(firstName: String, lastName: String) def firstNFamilies(n: Int, persons: List[Person]): List[Person] = { val firstFamilies = persons.map(p => p.lastName).distinct.take(n) persons.filter(p => firstFamilies.contains(p.lastName)) }
  2. 2. class DirectParquetOutputCommitter(outputPath: Path, context: TaskAttemptContext) extends ParquetOutputCommitter(outputPath, context) { … } Java class from org.apache.parquet:parquet-hadoop Scala class from org.apache.spark:spark-core_2.10
  3. 3. Nonsense! No Way! RAGE!!11
  4. 4. http://vmturbo.com/wp-content/uploads/2015/05/ScaleUpScaleOut_sm-min.jpg
  5. 5. http://vmturbo.com/wp-content/uploads/2015/05/ScaleUpScaleOut_sm-min.jpg
  6. 6. val numbers = 1 to 100000 val result = numbers.map(slowF)
  7. 7. val numbers = 1 to 100000 val result = numbers.par.map(slowF) Parallelizes next manipulations over available CPUs
  8. 8. val numbers = 1 to 100000 val result = sparkContext.parallelize(numbers).map(slowF) Parallelizes next manipulations over scalable cluster, by creating a Spark RDD - a Resilient Distributed Dataset
  9. 9. photo: http://www.swissict-award.ch/fileadmin/award/Pressebilder/Martin_Odersky_Scala.jpg
  10. 10. Map Map MapMap Map (retry)

×