Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An introduction to Apache Spark

10,629 views

Published on

A introduction to Apache Spark, what is it and
how does it work ? Why use it and some examples
of use.

Published in: Technology, Education

An introduction to Apache Spark

  1. 1. Apache Spark ● What is it ? ● How does it work ? ● Benefits ● Tuning ● Examples www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  2. 2. Spark – What is it ? ● Open Source ● Alternative to Map Reduce for certain applications ● A low latency cluster computing system ● For very large data sets ● May be 100 times faster than Map Reduce for – Iterative algorithms – Interactive data mining ● Used with Hadoop / HDFS ● Released under BSD License www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  3. 3. Spark – How does it work ? ● Uses in memory cluster computing ● Memory access faster than disk access ● Has API's written in – Scala – Java – Python ● Can be accessed from Scala and Python shells ● Currently an Apache incubator project www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  4. 4. Spark – Benefits ● Scales to very large clusters ● Uses in memory processing for increased speed ● High Level API's – Java, Scala, Python ● Low latency shell access www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  5. 5. Spark – Tuning ● Bottlenecks can occur in the cluster via – CPU, memory or network bandwidth ● Tune data serialization method i.e. – Java ObjectOutputStream vs Kryo ● Memory Tuning – Use primitive types – Set JVM Flags – Store objects in serialized form i.e. ● RDD Persistence ● MEMORY_ONLY_SER www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  6. 6. Spark – Examples Example from spark-project.org, Spark job in Scala. Showing a simple text count from a system log. /*** SimpleJob.scala ***/ import spark.SparkContext import SparkContext._ object SimpleJob { def main(args: Array[String]) { val logFile = "/var/log/syslog" // Should be some file on your system val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME", List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar")) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  7. 7. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems

×