codecentric AG
Matthias Niehoff
Big Data Analytics with Cassandra, Spark & MLLib
codecentric AG
– Spark
 Basics
 In A Cluster
 Cassandra Spark Connector
 Use Cases
– Spark Streaming
– Spark SQL
– Spark MLLib
– Live Demo
AGENDA
codecentric AG
SELECT * FROM performer WHERE name = 'ACDC’
ok
SELECT * FROM performer WHERE name = 'ACDC’
and country = ’Australia’
not ok
SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER
BY quantity DESC
not possible
CQL – QUERYING LANGUAGE WITH LIMITATIONS
18.06.2015 5
performer
name text K
style text
country text
type text
codecentric AG
– Open Source & Apache project since 2010
– Data processing framework
 Batch processing
 Stream processing
WHAT IS APACHE SPARK?
18/06/2015 6
codecentric AG
– Fast
 up to 100 times faster than Hadoop
 a lot of in-memory processing
 scalable about nodes
– Easy
 Scala, Java and Python API
 Clean Code (e.g. with lambdas in Java 8)
 expanded API: map, reduce, filter, groupBy, sort, union, join, reduceByKey,
groupByKey, sample, take, first, count ..
– Fault-Tolerant
 easily reproducible
WHY USE SPARK?
18.06.2015 7
codecentric AG
– RDD‘s – Resilient Distributed Dataset
 Read – Only Collection of Objects
 Distributed among the Cluster (on memory or disk)
 Determined through transformations
 Automatically rebuild on failure
– Operations
 Transformations (map,filter,reduce...)  new RDD
 Actions (count, collect, save)
– Only Actions start processing!
EASILY REPRODUCIBLE?
18.06.2015 8
codecentric AG
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
scala> linesWithSpark.count()
res0: Long = 126
RDD EXAMPLE
18.06.2015 9
codecentric AG
REPRODUCE RDD USING A TREE
18.06.2015 10
Datenquelle
rdd1
rdd3
val1 rdd5
rdd2
rdd4
val2
rdd6
val3
map(..)filter(..)
union(..)
count()
count() count()
sample(..)
cache()
codecentric AG
– Spark Cassandra Connector by Datastax
 https://github.com/datastax/spark-cassandra-connector
– Cassandra tables as Spark RDD (read & write)
– Mapping of C* tables and rows onto Java/Scala objects
– Server-Side filtering („where“)
– Compatible with
 Spark ≥ 0.9
 Cassandra ≥ 2.0
– Clone & Compile with SBT or download at Maven Central
SPARK ON CASSANDRA
18.06.2015 16
codecentric AG
– Start Spark Shell
bin/spark-shell
--jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar
--conf spark.cassandra.connection.host=localhost
--driver-memory 2g
--executor-memory 3g
– Import Cassandra Classes
scala> import com.datastax.spark.connector._,
USE SPARK CONNECTOR
18.06.2015 17
codecentric AG
– Read complete table
val movies = sc.cassandraTable("movie", "movies")
// Return type: CassandraRDD[CassandraRow]
– Read selected columns
val movies = sc.cassandraTable("movie", "movies").select("title","year")
– Filter Rows
val movies = sc.cassandraTable("movie", "movies").where("title = 'Die Hard'")
READ TABLE
18.06.2015 18
codecentric AG
– Access columns in ResultSet
movies.collect.foreach(r => println(r.get[String]("title")))
– As Scala tuple
sc.cassandraTable[(String,Int)]("movie","movies")
.select("title","year")
or
sc.cassandraTable("movie","movies").select("title","year")
.as((_: String, _:Int))
CassandraRDD[(String,Int)]
READ TABLE
18.06.2015 19
getString,
getStringOption,
getOrElse, getMap,
codecentric AG
case class Movie(name: String, year: Int)
sc.cassandraTable[Movie]("movie","movies")
.select("title","year")
sc.cassandraTable("movie","movies").select("title","year")
.as(Movie)
CassandraRDD[Movie]
READ TABLE – CASE CLASS
18.06.2015 20
codecentric AG
– Every RDD can be saved (not only CassandraRDD)
– Tuple
val tuples = sc.parallelize(Seq(("Hobbit",2012),
("96 Hours",2008)))
tuples.saveToCassandra("movie","movies",
SomeColumns("title","year")
- Case Class
case class Movie (title:String, year: int)
val objects = sc.parallelize(Seq( Movie("Hobbit",2012),
Movie("96 Hours",2008)))
objects.saveToCassandra("movie","movies")
WRITE TABLE
18.06.2015 21
codecentric AG
( [atomic,collection,object] , [atomic,collection,object])
val fluege= List( ("Thomas", "Berlin") ,("Mark", "Paris"), ("Thomas", "Madrid") )
val pairRDD = sc.parallelize(fluege)
pairRDD.filter(_._1 == "Thomas")
.collect
.foreach(t => println(t._1 + " flog nach " + t._2))
PAIR RDD‘S
18.06.2015 23
key – not unique value
codecentric AG
– Parallelization!
 keys are use for partitioning
 pairs with different keys are distributed across the cluster
– Efficient processing of
 aggregate by key
 group by key
 sort by key
 joins, union based on keys
WHY USE PAIR RDD'S?
18.06.2015 24
codecentric AG
– keys(), values()
– mapValues(func), flatMapValues(func)
– lookup(key), collectAsMap(), countByKey()
– reduceByKey(func), foldByKey(zeroValue)(func)
– groupByKey(), cogroup(otherDataset)
– sortByKey([ascending])
– join(otherDataset), leftOuterJoin(otherDataset), rightOuterJoin(otherDataset)
– union(otherDataset), substractByKey(otherDataset)
SPECIAL OP‘S FOR PAIR RDD‘S
18.06.2015 25
codecentric AG
val pairRDD = sc.cassandraTable("movie","director")
.map(r => (r.getString("country"),r))
// Directors / Country, sorted
pairRDD.mapValues(v => 1).reduceByKey(_+_).sortBy(-
_._2).collect.foreach(println)
// or, unsorted
pairRDD.countByKey().foreach(println)
// All Countries
pairRDD.keys()
CASSANDRA EXAMPLE
18.06.2015 26
director
name text K
country text
codecentric AG
pairRDD.groupByCountry()
RDD[(String,Iterable[CassandraRow])]
val directors = sc.cassandraTable(..).map(r => r.getString("name"),r))
val movies = sc.cassandraTable().map(r => r.getString("director"),r))
directors.cogroup(movies)
RDD[(String,
(Iterable[CassandraRow], Iterable[CassandraRow]))]
CASSANDRA EXAMPLE
18.06.2015 27
director
name text K
country text
movie
title text K
director text
Director Movies
codecentric AG
– Joins could be expensive
 partitions for same keys in different
tables on different nodes
 requires shuffling
val directors = sc.cassandraTable(..).map(r => (r.getString("name"),r))
val movies = sc.cassandraTable().map(r => (r.getString("director"),r))
movies.join(directors)
 RDD[(String, (CassandraRow, CassandraRow))]
CASSANDRA EXAMPLE - JOINS
18.06.2015 28
director
name text K
country text
movie
title text K
director text
DirectorMovie
codecentric AG
USE CASES
18.06.2015 30
DataLoading
Validation& Normalization
Analyses (Joins, Transformations,..)
Schema Migration
DataConversion
codecentric AG
– In particular for huge amounts of external data
– Support for CSV, TSV, XML, JSON und other
– Example:
case class User (id: java.util.UUID, name: String)
val users = sc.textFile("users.csv")
.repartition(2*sc.defaultParallelism)
.map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)})
users.saveToCassandra("keyspace","users")
DATA LOADING
18.06.2015 31
codecentric AG
– Validate consistency in a Cassandra Database
 syntactic
 Uniqueness (only relevant for columns not in the PK)
 Referential integrity
 Integrity of the duplicates
 semantic
 Business- or Application constraints
 e.g.: At least one genre per movies, a maximum of 10 tags per blog post
VALIDATION
18.06.2015 32
codecentric AG
– Modelling, Mining, Transforming, ....
– Use Cases
 Recommendation
 Fraud Detection
 Link Analysis (Social Networks, Web)
 Advertising
 Data Stream Analytics ( Spark Streaming)
 Machine Learning ( Spark ML)
ANALYSES
18.06.2015 33
codecentric AG
– Changes on existing tables
 New table required when changing primary key
 Otherwise changes could be performed in-place
– Creating new tables
 data derived from existing tables
 Support new queries
– Use the CassandraConnectors in Spark
val cc = CassandraConnector(sc.getConf)
cc.withSessionDo{session => session.execute(ALTER TABLE movie.movies ADD
year timestamp}
SCHEMA MIGRATION & DATA CONVERSION
18.06.2015 34
codecentric AG
STREAM PROCESSING
18.06.2015 35
.. with Spark Streaming
– Real Time Processing
– Supported sources: TCP, HDFS, S3, Kafka, Twitter,..
– Data as Discretized Stream (DStream)
– All Operations of the GenericRDD
– Stateful Operations
– Sliding Windows
codecentric AG
val ssc = new StreamingContext(sc,Seconds(1))
val stream = ssc.socketTextStream("127.0.0.1",9999)
stream.map(x => 1).reduce(_ + _).print()
ssc.start()
// await manual termination or error
ssc.awaitTermination()
// manual termination
ssc.stop()
SPARK STREAMING - BEISPIEL
18.06.2015 36
codecentric AG
– SQL Queries with Spark (SQL & HiveQL)
 On structured data
 On SchemaRDD‘s
 Every result of Spark SQL are SchemaRDD‘s
 All operations of the GenericRDD‘s available
– Unterstützt, auch auf nicht PK Spalten,...
 Joins
 Union
 Group By
 Having
 Order By
SPARK SQL
18.06.2015 37
codecentric AG
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("musicdb")
val result = csc.sql("SELECT country, COUNT(*) as anzahl" +
"FROM artists GROUP BY country" +
"ORDER BY anzahl DESC");
result.collect.foreach(println);
SPARK SQL – CASSANDRA BEISPIEL
18.06.2015 38
performer
name text K
style text
country text
type text
codecentric AG
val sqlContext = new SQLContext(sc)
val personen = sqlContext.jsonFile(path)
// Show the schema
personen.printSchema()
personen.registerTempTable("personen")
val erwachsene =
sqlContext.sql("SELECT name FROM personen WHERE alter > 18")
erwachsene.collect.foreach(println)
SPARK SQL – RDD / JSON BEISPIEL
18.06.2015 39
{"name":"Michael"}
{"name":"Jan", "alter":30}
{"name":"Tim", "alter":17}
codecentric AG
– Fully integrated in Spark
 Scalable
 Scala, Java & Python APIs
 Use with Spark Streaming & Spark SQL
– Packages various algorithms for machine learning
– Includes
 Clustering
 Classification
 Prediction
 Collaborative Filtering
– Still under development
 performance, algorithms
SPARK MLLIB
18.06.2015 40
codecentric AG
age
EXAMPLE – CLUSTERING
18.06.2015 41
set of data points meaningful clusters
income
codecentric AG
// Load and parse data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(
s.split(' ').map(_.toDouble))).cache()
// Cluster the data into 3 classes using KMeans with 20 iterations
val clusters = KMeans.train(parsedData, 2, 20)
// Evaluate clustering by computing Sum of Squared Errors
val SSE = clusters.computeCost(parsedData)
println("Sum of Squared Errors = " + WSSSE)
EXAMPLE – CLUSTERING (K-MEANS)
18.06.2015 42
codecentric AG
EXAMPLE – CLASSIFICATION
18.06.2015 43
codecentric AG
EXAMPLE – CLASSIFICATION
18.06.2015 44
codecentric AG
// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt")
// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
EXAMPLE – CLASSIFICATION (LINEAR SVM)
18.06.2015 45
codecentric AG
// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
EXAMPLE – CLASSIFICATION (LINEAR SVM)
18.06.2015 46
codecentric AG
COLLABORATIVE FILTERING
18.06.2015 47
codecentric AG
// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate)
=> Rating(user.toInt, item.toInt, rate.toDouble) })
// Build the recommendation model using ALS
val rank = 10
val numIterations = 20
val model = ALS.train(ratings, rank, numIterations, 0.01)
COLLABORATIVE FILTERING (ALS)
18.06.2015 48
codecentric AG
// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate)
=> Rating(user.toInt, item.toInt, rate.toDouble) })
// Build the recommendation model using ALS
val rank = 10
val numIterations = 20
val model = ALS.train(ratings, rank, numIterations, 0.01)
COLLABORATIVE FILTERING (ALS)
18.06.2015 49
codecentric AG
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate)
=> (user, product) }
val predictions = model.predict(usersProducts).map {
case Rating(user, product, rate) => ((user, product), rate) }
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)}.join(predictions)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2); err * err }.mean()
println("Mean Squared Error = " + MSE)
COLLABORATIVE FILTERING (ALS)
18.06.2015 50
codecentric AG
– Setup, Codes & Examples on Github
 https://github.com/mniehoff/spark-cassandra-playground
– Install Cassandra
 For local experiments: Cassandra Cluster Manager
https://github.com/pcmanus/ccm
 ccm create test -v binary:2.0.5 -n 3 –s
 ccm status
 ccm node1 status
– Install Spark
 Tar Ball (Pre Built for Hadoop 2.6)
 HomeBrew
 MacPorts
CURIOUS? GETTING STARTED!
18.06.2015 51
codecentric AG
– Spark Cassandra Connector
 https://github.com/datastax/spark-cassandra-connector
 Clone, ./sbt/sbt assembly
or
 Download Jar @ Maven Central
– Start Spark shell
spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-
SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost
– Import Cassandra classes
scala> import com.datastax.spark.connector._,
CURIOUS? GETTING STARTED!
18.06.2015 52
codecentric AG
– Production ready Cassandra
– Integrates
 Spark
 Hadoop
 Solr
– OpsCenter & DevCenter
– Bundled and ready to go
– Free for development and test
– http://www.datastax.com/download
(One time registration required)
DATASTAX ENTERPRISE EDITION
18.06.2015 53
codecentric AG
QUESTIONS?
18.06.2015 57
codecentric AG
Matthias Niehoff
codecentric AG
Zeppelinstraße 2
76185 Karlsruhe
tel +49 (0) 721 – 95 95 681
matthias.niehoff@codecentric.de
KONTAKT
18.06.2015 58

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

  • 1.
    codecentric AG Matthias Niehoff BigData Analytics with Cassandra, Spark & MLLib
  • 2.
    codecentric AG – Spark Basics  In A Cluster  Cassandra Spark Connector  Use Cases – Spark Streaming – Spark SQL – Spark MLLib – Live Demo AGENDA
  • 3.
    codecentric AG SELECT *FROM performer WHERE name = 'ACDC’ ok SELECT * FROM performer WHERE name = 'ACDC’ and country = ’Australia’ not ok SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER BY quantity DESC not possible CQL – QUERYING LANGUAGE WITH LIMITATIONS 18.06.2015 5 performer name text K style text country text type text
  • 4.
    codecentric AG – OpenSource & Apache project since 2010 – Data processing framework  Batch processing  Stream processing WHAT IS APACHE SPARK? 18/06/2015 6
  • 5.
    codecentric AG – Fast up to 100 times faster than Hadoop  a lot of in-memory processing  scalable about nodes – Easy  Scala, Java and Python API  Clean Code (e.g. with lambdas in Java 8)  expanded API: map, reduce, filter, groupBy, sort, union, join, reduceByKey, groupByKey, sample, take, first, count .. – Fault-Tolerant  easily reproducible WHY USE SPARK? 18.06.2015 7
  • 6.
    codecentric AG – RDD‘s– Resilient Distributed Dataset  Read – Only Collection of Objects  Distributed among the Cluster (on memory or disk)  Determined through transformations  Automatically rebuild on failure – Operations  Transformations (map,filter,reduce...)  new RDD  Actions (count, collect, save) – Only Actions start processing! EASILY REPRODUCIBLE? 18.06.2015 8
  • 7.
    codecentric AG scala> valtextFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 scala> linesWithSpark.count() res0: Long = 126 RDD EXAMPLE 18.06.2015 9
  • 8.
    codecentric AG REPRODUCE RDDUSING A TREE 18.06.2015 10 Datenquelle rdd1 rdd3 val1 rdd5 rdd2 rdd4 val2 rdd6 val3 map(..)filter(..) union(..) count() count() count() sample(..) cache()
  • 9.
    codecentric AG – SparkCassandra Connector by Datastax  https://github.com/datastax/spark-cassandra-connector – Cassandra tables as Spark RDD (read & write) – Mapping of C* tables and rows onto Java/Scala objects – Server-Side filtering („where“) – Compatible with  Spark ≥ 0.9  Cassandra ≥ 2.0 – Clone & Compile with SBT or download at Maven Central SPARK ON CASSANDRA 18.06.2015 16
  • 10.
    codecentric AG – StartSpark Shell bin/spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost --driver-memory 2g --executor-memory 3g – Import Cassandra Classes scala> import com.datastax.spark.connector._, USE SPARK CONNECTOR 18.06.2015 17
  • 11.
    codecentric AG – Readcomplete table val movies = sc.cassandraTable("movie", "movies") // Return type: CassandraRDD[CassandraRow] – Read selected columns val movies = sc.cassandraTable("movie", "movies").select("title","year") – Filter Rows val movies = sc.cassandraTable("movie", "movies").where("title = 'Die Hard'") READ TABLE 18.06.2015 18
  • 12.
    codecentric AG – Accesscolumns in ResultSet movies.collect.foreach(r => println(r.get[String]("title"))) – As Scala tuple sc.cassandraTable[(String,Int)]("movie","movies") .select("title","year") or sc.cassandraTable("movie","movies").select("title","year") .as((_: String, _:Int)) CassandraRDD[(String,Int)] READ TABLE 18.06.2015 19 getString, getStringOption, getOrElse, getMap,
  • 13.
    codecentric AG case classMovie(name: String, year: Int) sc.cassandraTable[Movie]("movie","movies") .select("title","year") sc.cassandraTable("movie","movies").select("title","year") .as(Movie) CassandraRDD[Movie] READ TABLE – CASE CLASS 18.06.2015 20
  • 14.
    codecentric AG – EveryRDD can be saved (not only CassandraRDD) – Tuple val tuples = sc.parallelize(Seq(("Hobbit",2012), ("96 Hours",2008))) tuples.saveToCassandra("movie","movies", SomeColumns("title","year") - Case Class case class Movie (title:String, year: int) val objects = sc.parallelize(Seq( Movie("Hobbit",2012), Movie("96 Hours",2008))) objects.saveToCassandra("movie","movies") WRITE TABLE 18.06.2015 21
  • 15.
    codecentric AG ( [atomic,collection,object], [atomic,collection,object]) val fluege= List( ("Thomas", "Berlin") ,("Mark", "Paris"), ("Thomas", "Madrid") ) val pairRDD = sc.parallelize(fluege) pairRDD.filter(_._1 == "Thomas") .collect .foreach(t => println(t._1 + " flog nach " + t._2)) PAIR RDD‘S 18.06.2015 23 key – not unique value
  • 16.
    codecentric AG – Parallelization! keys are use for partitioning  pairs with different keys are distributed across the cluster – Efficient processing of  aggregate by key  group by key  sort by key  joins, union based on keys WHY USE PAIR RDD'S? 18.06.2015 24
  • 17.
    codecentric AG – keys(),values() – mapValues(func), flatMapValues(func) – lookup(key), collectAsMap(), countByKey() – reduceByKey(func), foldByKey(zeroValue)(func) – groupByKey(), cogroup(otherDataset) – sortByKey([ascending]) – join(otherDataset), leftOuterJoin(otherDataset), rightOuterJoin(otherDataset) – union(otherDataset), substractByKey(otherDataset) SPECIAL OP‘S FOR PAIR RDD‘S 18.06.2015 25
  • 18.
    codecentric AG val pairRDD= sc.cassandraTable("movie","director") .map(r => (r.getString("country"),r)) // Directors / Country, sorted pairRDD.mapValues(v => 1).reduceByKey(_+_).sortBy(- _._2).collect.foreach(println) // or, unsorted pairRDD.countByKey().foreach(println) // All Countries pairRDD.keys() CASSANDRA EXAMPLE 18.06.2015 26 director name text K country text
  • 19.
    codecentric AG pairRDD.groupByCountry() RDD[(String,Iterable[CassandraRow])] val directors= sc.cassandraTable(..).map(r => r.getString("name"),r)) val movies = sc.cassandraTable().map(r => r.getString("director"),r)) directors.cogroup(movies) RDD[(String, (Iterable[CassandraRow], Iterable[CassandraRow]))] CASSANDRA EXAMPLE 18.06.2015 27 director name text K country text movie title text K director text Director Movies
  • 20.
    codecentric AG – Joinscould be expensive  partitions for same keys in different tables on different nodes  requires shuffling val directors = sc.cassandraTable(..).map(r => (r.getString("name"),r)) val movies = sc.cassandraTable().map(r => (r.getString("director"),r)) movies.join(directors)  RDD[(String, (CassandraRow, CassandraRow))] CASSANDRA EXAMPLE - JOINS 18.06.2015 28 director name text K country text movie title text K director text DirectorMovie
  • 21.
    codecentric AG USE CASES 18.06.201530 DataLoading Validation& Normalization Analyses (Joins, Transformations,..) Schema Migration DataConversion
  • 22.
    codecentric AG – Inparticular for huge amounts of external data – Support for CSV, TSV, XML, JSON und other – Example: case class User (id: java.util.UUID, name: String) val users = sc.textFile("users.csv") .repartition(2*sc.defaultParallelism) .map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)}) users.saveToCassandra("keyspace","users") DATA LOADING 18.06.2015 31
  • 23.
    codecentric AG – Validateconsistency in a Cassandra Database  syntactic  Uniqueness (only relevant for columns not in the PK)  Referential integrity  Integrity of the duplicates  semantic  Business- or Application constraints  e.g.: At least one genre per movies, a maximum of 10 tags per blog post VALIDATION 18.06.2015 32
  • 24.
    codecentric AG – Modelling,Mining, Transforming, .... – Use Cases  Recommendation  Fraud Detection  Link Analysis (Social Networks, Web)  Advertising  Data Stream Analytics ( Spark Streaming)  Machine Learning ( Spark ML) ANALYSES 18.06.2015 33
  • 25.
    codecentric AG – Changeson existing tables  New table required when changing primary key  Otherwise changes could be performed in-place – Creating new tables  data derived from existing tables  Support new queries – Use the CassandraConnectors in Spark val cc = CassandraConnector(sc.getConf) cc.withSessionDo{session => session.execute(ALTER TABLE movie.movies ADD year timestamp} SCHEMA MIGRATION & DATA CONVERSION 18.06.2015 34
  • 26.
    codecentric AG STREAM PROCESSING 18.06.201535 .. with Spark Streaming – Real Time Processing – Supported sources: TCP, HDFS, S3, Kafka, Twitter,.. – Data as Discretized Stream (DStream) – All Operations of the GenericRDD – Stateful Operations – Sliding Windows
  • 27.
    codecentric AG val ssc= new StreamingContext(sc,Seconds(1)) val stream = ssc.socketTextStream("127.0.0.1",9999) stream.map(x => 1).reduce(_ + _).print() ssc.start() // await manual termination or error ssc.awaitTermination() // manual termination ssc.stop() SPARK STREAMING - BEISPIEL 18.06.2015 36
  • 28.
    codecentric AG – SQLQueries with Spark (SQL & HiveQL)  On structured data  On SchemaRDD‘s  Every result of Spark SQL are SchemaRDD‘s  All operations of the GenericRDD‘s available – Unterstützt, auch auf nicht PK Spalten,...  Joins  Union  Group By  Having  Order By SPARK SQL 18.06.2015 37
  • 29.
    codecentric AG val csc= new CassandraSQLContext(sc) csc.setKeyspace("musicdb") val result = csc.sql("SELECT country, COUNT(*) as anzahl" + "FROM artists GROUP BY country" + "ORDER BY anzahl DESC"); result.collect.foreach(println); SPARK SQL – CASSANDRA BEISPIEL 18.06.2015 38 performer name text K style text country text type text
  • 30.
    codecentric AG val sqlContext= new SQLContext(sc) val personen = sqlContext.jsonFile(path) // Show the schema personen.printSchema() personen.registerTempTable("personen") val erwachsene = sqlContext.sql("SELECT name FROM personen WHERE alter > 18") erwachsene.collect.foreach(println) SPARK SQL – RDD / JSON BEISPIEL 18.06.2015 39 {"name":"Michael"} {"name":"Jan", "alter":30} {"name":"Tim", "alter":17}
  • 31.
    codecentric AG – Fullyintegrated in Spark  Scalable  Scala, Java & Python APIs  Use with Spark Streaming & Spark SQL – Packages various algorithms for machine learning – Includes  Clustering  Classification  Prediction  Collaborative Filtering – Still under development  performance, algorithms SPARK MLLIB 18.06.2015 40
  • 32.
    codecentric AG age EXAMPLE –CLUSTERING 18.06.2015 41 set of data points meaningful clusters income
  • 33.
    codecentric AG // Loadand parse data val data = sc.textFile("data/mllib/kmeans_data.txt") val parsedData = data.map(s => Vectors.dense( s.split(' ').map(_.toDouble))).cache() // Cluster the data into 3 classes using KMeans with 20 iterations val clusters = KMeans.train(parsedData, 2, 20) // Evaluate clustering by computing Sum of Squared Errors val SSE = clusters.computeCost(parsedData) println("Sum of Squared Errors = " + WSSSE) EXAMPLE – CLUSTERING (K-MEANS) 18.06.2015 42
  • 34.
    codecentric AG EXAMPLE –CLASSIFICATION 18.06.2015 43
  • 35.
    codecentric AG EXAMPLE –CLASSIFICATION 18.06.2015 44
  • 36.
    codecentric AG // Loadtraining data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt") // Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) // Run training algorithm to build the model val numIterations = 100 val model = SVMWithSGD.train(training, numIterations) EXAMPLE – CLASSIFICATION (LINEAR SVM) 18.06.2015 45
  • 37.
    codecentric AG // Computeraw scores on the test set. val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) } // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() println("Area under ROC = " + auROC) EXAMPLE – CLASSIFICATION (LINEAR SVM) 18.06.2015 46
  • 38.
  • 39.
    codecentric AG // Loadand parse the data (userid,itemid,rating) val data = sc.textFile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val rank = 10 val numIterations = 20 val model = ALS.train(ratings, rank, numIterations, 0.01) COLLABORATIVE FILTERING (ALS) 18.06.2015 48
  • 40.
    codecentric AG // Loadand parse the data (userid,itemid,rating) val data = sc.textFile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val rank = 10 val numIterations = 20 val model = ALS.train(ratings, rank, numIterations, 0.01) COLLABORATIVE FILTERING (ALS) 18.06.2015 49
  • 41.
    codecentric AG // Evaluatethe model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts).map { case Rating(user, product, rate) => ((user, product), rate) } val ratesAndPreds = ratings.map { case Rating(user, product, rate) => ((user, product), rate)}.join(predictions) val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) => val err = (r1 - r2); err * err }.mean() println("Mean Squared Error = " + MSE) COLLABORATIVE FILTERING (ALS) 18.06.2015 50
  • 42.
    codecentric AG – Setup,Codes & Examples on Github  https://github.com/mniehoff/spark-cassandra-playground – Install Cassandra  For local experiments: Cassandra Cluster Manager https://github.com/pcmanus/ccm  ccm create test -v binary:2.0.5 -n 3 –s  ccm status  ccm node1 status – Install Spark  Tar Ball (Pre Built for Hadoop 2.6)  HomeBrew  MacPorts CURIOUS? GETTING STARTED! 18.06.2015 51
  • 43.
    codecentric AG – SparkCassandra Connector  https://github.com/datastax/spark-cassandra-connector  Clone, ./sbt/sbt assembly or  Download Jar @ Maven Central – Start Spark shell spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0- SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost – Import Cassandra classes scala> import com.datastax.spark.connector._, CURIOUS? GETTING STARTED! 18.06.2015 52
  • 44.
    codecentric AG – Productionready Cassandra – Integrates  Spark  Hadoop  Solr – OpsCenter & DevCenter – Bundled and ready to go – Free for development and test – http://www.datastax.com/download (One time registration required) DATASTAX ENTERPRISE EDITION 18.06.2015 53
  • 45.
  • 46.
    codecentric AG Matthias Niehoff codecentricAG Zeppelinstraße 2 76185 Karlsruhe tel +49 (0) 721 – 95 95 681 matthias.niehoff@codecentric.de KONTAKT 18.06.2015 58

Editor's Notes

  • #5 token = murmur3(partition key)
  • #9 speichert Liste der Partitionen
  • #12 tasks from different applications run in different JVMs Driver programm near to workers, e.g. same LAN
  • #13 GroupBy, ReduceBy, SQL Joins
  • #14 - Cassandra as Datasotre Spark for Data Modelling & Data Analytics Data Locality
  • #15 Data Locality Workload Isolation
  • #16 Master: Worker, Running / Completed Apps. Worker: Executors
  • #18 Anwendung: Jar hinzufügen, Context erzeugen
  • #19 Filter rows possible on primary key and indexed columns
  • #24 Alternativ: keyBy (rdd => key)
  • #27 toSeq.sortBy(-_._2) für Sortierung
  • #29 Join n keys -> n rows Cogroup -> Same keys are grouped
  • #30 Data Locality: Partitions aus den Daten des lokalen C* Nodes
  • #32 defaultParallelism: Total Number of Cores on all Nodes http://icons.iconarchive.com/icons/custom-icon-design/office/256/import-icon.png
  • #33 http://usacanadaregion.org/sites/usacanadaregion.org/files/Images/checkmark.jpg
  • #34 http://www.webhostingreviewjam.com/wp-content/uploads/2013/02/icon_analytics.png
  • #36 Intervall: 500ms bis einige Sekunden https://spark.apache.org/docs/latest/img/streaming-flow.png
  • #38 Schema RDD: Schema Informationen, besteht aus Zeilen.
  • #39 Tabelle mit Sänger, PK nur der Name
  • #40 Tabelle mit Sänger, PK nur der Name
  • #42 unsupervised
  • #43 clusters.clusterCenters
  • #44 supervised
  • #45 supervised
  • #46 SVM = Support Vector Machine