Your SlideShare is downloading. ×
An introduction to Apache Spark
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

An introduction to Apache Spark

4,613
views

Published on

A introduction to Apache Spark, what is it and …

A introduction to Apache Spark, what is it and
how does it work ? Why use it and some examples
of use.

Published in: Technology, Education

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,613
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
97
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache Spark ● What is it ? ● How does it work ? ● Benefits ● Tuning ● Examples www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 2. Spark – What is it ? ● Open Source ● Alternative to Map Reduce for certain applications ● A low latency cluster computing system ● For very large data sets ● May be 100 times faster than Map Reduce for – Iterative algorithms – Interactive data mining ● Used with Hadoop / HDFS ● Released under BSD License www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 3. Spark – How does it work ? ● Uses in memory cluster computing ● Memory access faster than disk access ● Has API's written in – Scala – Java – Python ● Can be accessed from Scala and Python shells ● Currently an Apache incubator project www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 4. Spark – Benefits ● Scales to very large clusters ● Uses in memory processing for increased speed ● High Level API's – Java, Scala, Python ● Low latency shell access www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 5. Spark – Tuning ● Bottlenecks can occur in the cluster via – CPU, memory or network bandwidth ● Tune data serialization method i.e. – Java ObjectOutputStream vs Kryo ● Memory Tuning – Use primitive types – Set JVM Flags – Store objects in serialized form i.e. ● RDD Persistence ● MEMORY_ONLY_SER www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 6. Spark – Examples Example from spark-project.org, Spark job in Scala. Showing a simple text count from a system log. /*** SimpleJob.scala ***/ import spark.SparkContext import SparkContext._ object SimpleJob { def main(args: Array[String]) { val logFile = "/var/log/syslog" // Should be some file on your system val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME", List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar")) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 7. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems

×