Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
codecentric AG
Matthias Niehoff
Big Data Analytics with Cassandra, Spark & MLLib
codecentric AG
– Spark
 Basics
 In A Cluster
 Cassandra Spark Connector
 Use Cases
– Spark Streaming
– Spark SQL
– Spa...
codecentric AG
SELECT * FROM performer WHERE name = 'ACDC’
ok
SELECT * FROM performer WHERE name = 'ACDC’
and country = ’...
codecentric AG
– Open Source & Apache project since 2010
– Data processing framework
 Batch processing
 Stream processin...
codecentric AG
– Fast
 up to 100 times faster than Hadoop
 a lot of in-memory processing
 scalable about nodes
– Easy
...
codecentric AG
– RDD‘s – Resilient Distributed Dataset
 Read – Only Collection of Objects
 Distributed among the Cluster...
codecentric AG
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
scala...
codecentric AG
REPRODUCE RDD USING A TREE
18.06.2015 10
Datenquelle
rdd1
rdd3
val1 rdd5
rdd2
rdd4
val2
rdd6
val3
map(..)fi...
codecentric AG
– Spark Cassandra Connector by Datastax
 https://github.com/datastax/spark-cassandra-connector
– Cassandra...
codecentric AG
– Start Spark Shell
bin/spark-shell
--jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT....
codecentric AG
– Read complete table
val movies = sc.cassandraTable("movie", "movies")
// Return type: CassandraRDD[Cassan...
codecentric AG
– Access columns in ResultSet
movies.collect.foreach(r => println(r.get[String]("title")))
– As Scala tuple...
codecentric AG
case class Movie(name: String, year: Int)
sc.cassandraTable[Movie]("movie","movies")
.select("title","year"...
codecentric AG
– Every RDD can be saved (not only CassandraRDD)
– Tuple
val tuples = sc.parallelize(Seq(("Hobbit",2012),
(...
codecentric AG
( [atomic,collection,object] , [atomic,collection,object])
val fluege= List( ("Thomas", "Berlin") ,("Mark",...
codecentric AG
– Parallelization!
 keys are use for partitioning
 pairs with different keys are distributed across the c...
codecentric AG
– keys(), values()
– mapValues(func), flatMapValues(func)
– lookup(key), collectAsMap(), countByKey()
– red...
codecentric AG
val pairRDD = sc.cassandraTable("movie","director")
.map(r => (r.getString("country"),r))
// Directors / Co...
codecentric AG
pairRDD.groupByCountry()
RDD[(String,Iterable[CassandraRow])]
val directors = sc.cassandraTable(..).map(r ...
codecentric AG
– Joins could be expensive
 partitions for same keys in different
tables on different nodes
 requires shu...
codecentric AG
USE CASES
18.06.2015 30
DataLoading
Validation& Normalization
Analyses (Joins, Transformations,..)
Schema M...
codecentric AG
– In particular for huge amounts of external data
– Support for CSV, TSV, XML, JSON und other
– Example:
ca...
codecentric AG
– Validate consistency in a Cassandra Database
 syntactic
 Uniqueness (only relevant for columns not in t...
codecentric AG
– Modelling, Mining, Transforming, ....
– Use Cases
 Recommendation
 Fraud Detection
 Link Analysis (Soc...
codecentric AG
– Changes on existing tables
 New table required when changing primary key
 Otherwise changes could be pe...
codecentric AG
STREAM PROCESSING
18.06.2015 35
.. with Spark Streaming
– Real Time Processing
– Supported sources: TCP, HD...
codecentric AG
val ssc = new StreamingContext(sc,Seconds(1))
val stream = ssc.socketTextStream("127.0.0.1",9999)
stream.ma...
codecentric AG
– SQL Queries with Spark (SQL & HiveQL)
 On structured data
 On SchemaRDD‘s
 Every result of Spark SQL a...
codecentric AG
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("musicdb")
val result = csc.sql("SELECT country, COUN...
codecentric AG
val sqlContext = new SQLContext(sc)
val personen = sqlContext.jsonFile(path)
// Show the schema
personen.pr...
codecentric AG
– Fully integrated in Spark
 Scalable
 Scala, Java & Python APIs
 Use with Spark Streaming & Spark SQL
–...
codecentric AG
age
EXAMPLE – CLUSTERING
18.06.2015 41
set of data points meaningful clusters
income
codecentric AG
// Load and parse data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => ...
codecentric AG
EXAMPLE – CLASSIFICATION
18.06.2015 43
codecentric AG
EXAMPLE – CLASSIFICATION
18.06.2015 44
codecentric AG
// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt")
// ...
codecentric AG
// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = model.predict(p...
codecentric AG
COLLABORATIVE FILTERING
18.06.2015 47
codecentric AG
// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ra...
codecentric AG
// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ra...
codecentric AG
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate)
=> ...
codecentric AG
– Setup, Codes & Examples on Github
 https://github.com/mniehoff/spark-cassandra-playground
– Install Cass...
codecentric AG
– Spark Cassandra Connector
 https://github.com/datastax/spark-cassandra-connector
 Clone, ./sbt/sbt ass...
codecentric AG
– Production ready Cassandra
– Integrates
 Spark
 Hadoop
 Solr
– OpsCenter & DevCenter
– Bundled and rea...
codecentric AG
QUESTIONS?
18.06.2015 57
codecentric AG
Matthias Niehoff
codecentric AG
Zeppelinstraße 2
76185 Karlsruhe
tel +49 (0) 721 – 95 95 681
matthias.nieho...
Upcoming SlideShare
Loading in …5
×

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

1,722 views

Published on

Talk about using Spark for Data Analytics on a Cassandra database

Published in: Software
  • Be the first to comment

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

  1. 1. codecentric AG Matthias Niehoff Big Data Analytics with Cassandra, Spark & MLLib
  2. 2. codecentric AG – Spark  Basics  In A Cluster  Cassandra Spark Connector  Use Cases – Spark Streaming – Spark SQL – Spark MLLib – Live Demo AGENDA
  3. 3. codecentric AG SELECT * FROM performer WHERE name = 'ACDC’ ok SELECT * FROM performer WHERE name = 'ACDC’ and country = ’Australia’ not ok SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER BY quantity DESC not possible CQL – QUERYING LANGUAGE WITH LIMITATIONS 18.06.2015 5 performer name text K style text country text type text
  4. 4. codecentric AG – Open Source & Apache project since 2010 – Data processing framework  Batch processing  Stream processing WHAT IS APACHE SPARK? 18/06/2015 6
  5. 5. codecentric AG – Fast  up to 100 times faster than Hadoop  a lot of in-memory processing  scalable about nodes – Easy  Scala, Java and Python API  Clean Code (e.g. with lambdas in Java 8)  expanded API: map, reduce, filter, groupBy, sort, union, join, reduceByKey, groupByKey, sample, take, first, count .. – Fault-Tolerant  easily reproducible WHY USE SPARK? 18.06.2015 7
  6. 6. codecentric AG – RDD‘s – Resilient Distributed Dataset  Read – Only Collection of Objects  Distributed among the Cluster (on memory or disk)  Determined through transformations  Automatically rebuild on failure – Operations  Transformations (map,filter,reduce...)  new RDD  Actions (count, collect, save) – Only Actions start processing! EASILY REPRODUCIBLE? 18.06.2015 8
  7. 7. codecentric AG scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 scala> linesWithSpark.count() res0: Long = 126 RDD EXAMPLE 18.06.2015 9
  8. 8. codecentric AG REPRODUCE RDD USING A TREE 18.06.2015 10 Datenquelle rdd1 rdd3 val1 rdd5 rdd2 rdd4 val2 rdd6 val3 map(..)filter(..) union(..) count() count() count() sample(..) cache()
  9. 9. codecentric AG – Spark Cassandra Connector by Datastax  https://github.com/datastax/spark-cassandra-connector – Cassandra tables as Spark RDD (read & write) – Mapping of C* tables and rows onto Java/Scala objects – Server-Side filtering („where“) – Compatible with  Spark ≥ 0.9  Cassandra ≥ 2.0 – Clone & Compile with SBT or download at Maven Central SPARK ON CASSANDRA 18.06.2015 16
  10. 10. codecentric AG – Start Spark Shell bin/spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost --driver-memory 2g --executor-memory 3g – Import Cassandra Classes scala> import com.datastax.spark.connector._, USE SPARK CONNECTOR 18.06.2015 17
  11. 11. codecentric AG – Read complete table val movies = sc.cassandraTable("movie", "movies") // Return type: CassandraRDD[CassandraRow] – Read selected columns val movies = sc.cassandraTable("movie", "movies").select("title","year") – Filter Rows val movies = sc.cassandraTable("movie", "movies").where("title = 'Die Hard'") READ TABLE 18.06.2015 18
  12. 12. codecentric AG – Access columns in ResultSet movies.collect.foreach(r => println(r.get[String]("title"))) – As Scala tuple sc.cassandraTable[(String,Int)]("movie","movies") .select("title","year") or sc.cassandraTable("movie","movies").select("title","year") .as((_: String, _:Int)) CassandraRDD[(String,Int)] READ TABLE 18.06.2015 19 getString, getStringOption, getOrElse, getMap,
  13. 13. codecentric AG case class Movie(name: String, year: Int) sc.cassandraTable[Movie]("movie","movies") .select("title","year") sc.cassandraTable("movie","movies").select("title","year") .as(Movie) CassandraRDD[Movie] READ TABLE – CASE CLASS 18.06.2015 20
  14. 14. codecentric AG – Every RDD can be saved (not only CassandraRDD) – Tuple val tuples = sc.parallelize(Seq(("Hobbit",2012), ("96 Hours",2008))) tuples.saveToCassandra("movie","movies", SomeColumns("title","year") - Case Class case class Movie (title:String, year: int) val objects = sc.parallelize(Seq( Movie("Hobbit",2012), Movie("96 Hours",2008))) objects.saveToCassandra("movie","movies") WRITE TABLE 18.06.2015 21
  15. 15. codecentric AG ( [atomic,collection,object] , [atomic,collection,object]) val fluege= List( ("Thomas", "Berlin") ,("Mark", "Paris"), ("Thomas", "Madrid") ) val pairRDD = sc.parallelize(fluege) pairRDD.filter(_._1 == "Thomas") .collect .foreach(t => println(t._1 + " flog nach " + t._2)) PAIR RDD‘S 18.06.2015 23 key – not unique value
  16. 16. codecentric AG – Parallelization!  keys are use for partitioning  pairs with different keys are distributed across the cluster – Efficient processing of  aggregate by key  group by key  sort by key  joins, union based on keys WHY USE PAIR RDD'S? 18.06.2015 24
  17. 17. codecentric AG – keys(), values() – mapValues(func), flatMapValues(func) – lookup(key), collectAsMap(), countByKey() – reduceByKey(func), foldByKey(zeroValue)(func) – groupByKey(), cogroup(otherDataset) – sortByKey([ascending]) – join(otherDataset), leftOuterJoin(otherDataset), rightOuterJoin(otherDataset) – union(otherDataset), substractByKey(otherDataset) SPECIAL OP‘S FOR PAIR RDD‘S 18.06.2015 25
  18. 18. codecentric AG val pairRDD = sc.cassandraTable("movie","director") .map(r => (r.getString("country"),r)) // Directors / Country, sorted pairRDD.mapValues(v => 1).reduceByKey(_+_).sortBy(- _._2).collect.foreach(println) // or, unsorted pairRDD.countByKey().foreach(println) // All Countries pairRDD.keys() CASSANDRA EXAMPLE 18.06.2015 26 director name text K country text
  19. 19. codecentric AG pairRDD.groupByCountry() RDD[(String,Iterable[CassandraRow])] val directors = sc.cassandraTable(..).map(r => r.getString("name"),r)) val movies = sc.cassandraTable().map(r => r.getString("director"),r)) directors.cogroup(movies) RDD[(String, (Iterable[CassandraRow], Iterable[CassandraRow]))] CASSANDRA EXAMPLE 18.06.2015 27 director name text K country text movie title text K director text Director Movies
  20. 20. codecentric AG – Joins could be expensive  partitions for same keys in different tables on different nodes  requires shuffling val directors = sc.cassandraTable(..).map(r => (r.getString("name"),r)) val movies = sc.cassandraTable().map(r => (r.getString("director"),r)) movies.join(directors)  RDD[(String, (CassandraRow, CassandraRow))] CASSANDRA EXAMPLE - JOINS 18.06.2015 28 director name text K country text movie title text K director text DirectorMovie
  21. 21. codecentric AG USE CASES 18.06.2015 30 DataLoading Validation& Normalization Analyses (Joins, Transformations,..) Schema Migration DataConversion
  22. 22. codecentric AG – In particular for huge amounts of external data – Support for CSV, TSV, XML, JSON und other – Example: case class User (id: java.util.UUID, name: String) val users = sc.textFile("users.csv") .repartition(2*sc.defaultParallelism) .map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)}) users.saveToCassandra("keyspace","users") DATA LOADING 18.06.2015 31
  23. 23. codecentric AG – Validate consistency in a Cassandra Database  syntactic  Uniqueness (only relevant for columns not in the PK)  Referential integrity  Integrity of the duplicates  semantic  Business- or Application constraints  e.g.: At least one genre per movies, a maximum of 10 tags per blog post VALIDATION 18.06.2015 32
  24. 24. codecentric AG – Modelling, Mining, Transforming, .... – Use Cases  Recommendation  Fraud Detection  Link Analysis (Social Networks, Web)  Advertising  Data Stream Analytics ( Spark Streaming)  Machine Learning ( Spark ML) ANALYSES 18.06.2015 33
  25. 25. codecentric AG – Changes on existing tables  New table required when changing primary key  Otherwise changes could be performed in-place – Creating new tables  data derived from existing tables  Support new queries – Use the CassandraConnectors in Spark val cc = CassandraConnector(sc.getConf) cc.withSessionDo{session => session.execute(ALTER TABLE movie.movies ADD year timestamp} SCHEMA MIGRATION & DATA CONVERSION 18.06.2015 34
  26. 26. codecentric AG STREAM PROCESSING 18.06.2015 35 .. with Spark Streaming – Real Time Processing – Supported sources: TCP, HDFS, S3, Kafka, Twitter,.. – Data as Discretized Stream (DStream) – All Operations of the GenericRDD – Stateful Operations – Sliding Windows
  27. 27. codecentric AG val ssc = new StreamingContext(sc,Seconds(1)) val stream = ssc.socketTextStream("127.0.0.1",9999) stream.map(x => 1).reduce(_ + _).print() ssc.start() // await manual termination or error ssc.awaitTermination() // manual termination ssc.stop() SPARK STREAMING - BEISPIEL 18.06.2015 36
  28. 28. codecentric AG – SQL Queries with Spark (SQL & HiveQL)  On structured data  On SchemaRDD‘s  Every result of Spark SQL are SchemaRDD‘s  All operations of the GenericRDD‘s available – Unterstützt, auch auf nicht PK Spalten,...  Joins  Union  Group By  Having  Order By SPARK SQL 18.06.2015 37
  29. 29. codecentric AG val csc = new CassandraSQLContext(sc) csc.setKeyspace("musicdb") val result = csc.sql("SELECT country, COUNT(*) as anzahl" + "FROM artists GROUP BY country" + "ORDER BY anzahl DESC"); result.collect.foreach(println); SPARK SQL – CASSANDRA BEISPIEL 18.06.2015 38 performer name text K style text country text type text
  30. 30. codecentric AG val sqlContext = new SQLContext(sc) val personen = sqlContext.jsonFile(path) // Show the schema personen.printSchema() personen.registerTempTable("personen") val erwachsene = sqlContext.sql("SELECT name FROM personen WHERE alter > 18") erwachsene.collect.foreach(println) SPARK SQL – RDD / JSON BEISPIEL 18.06.2015 39 {"name":"Michael"} {"name":"Jan", "alter":30} {"name":"Tim", "alter":17}
  31. 31. codecentric AG – Fully integrated in Spark  Scalable  Scala, Java & Python APIs  Use with Spark Streaming & Spark SQL – Packages various algorithms for machine learning – Includes  Clustering  Classification  Prediction  Collaborative Filtering – Still under development  performance, algorithms SPARK MLLIB 18.06.2015 40
  32. 32. codecentric AG age EXAMPLE – CLUSTERING 18.06.2015 41 set of data points meaningful clusters income
  33. 33. codecentric AG // Load and parse data val data = sc.textFile("data/mllib/kmeans_data.txt") val parsedData = data.map(s => Vectors.dense( s.split(' ').map(_.toDouble))).cache() // Cluster the data into 3 classes using KMeans with 20 iterations val clusters = KMeans.train(parsedData, 2, 20) // Evaluate clustering by computing Sum of Squared Errors val SSE = clusters.computeCost(parsedData) println("Sum of Squared Errors = " + WSSSE) EXAMPLE – CLUSTERING (K-MEANS) 18.06.2015 42
  34. 34. codecentric AG EXAMPLE – CLASSIFICATION 18.06.2015 43
  35. 35. codecentric AG EXAMPLE – CLASSIFICATION 18.06.2015 44
  36. 36. codecentric AG // Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt") // Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) // Run training algorithm to build the model val numIterations = 100 val model = SVMWithSGD.train(training, numIterations) EXAMPLE – CLASSIFICATION (LINEAR SVM) 18.06.2015 45
  37. 37. codecentric AG // Compute raw scores on the test set. val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) } // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() println("Area under ROC = " + auROC) EXAMPLE – CLASSIFICATION (LINEAR SVM) 18.06.2015 46
  38. 38. codecentric AG COLLABORATIVE FILTERING 18.06.2015 47
  39. 39. codecentric AG // Load and parse the data (userid,itemid,rating) val data = sc.textFile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val rank = 10 val numIterations = 20 val model = ALS.train(ratings, rank, numIterations, 0.01) COLLABORATIVE FILTERING (ALS) 18.06.2015 48
  40. 40. codecentric AG // Load and parse the data (userid,itemid,rating) val data = sc.textFile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val rank = 10 val numIterations = 20 val model = ALS.train(ratings, rank, numIterations, 0.01) COLLABORATIVE FILTERING (ALS) 18.06.2015 49
  41. 41. codecentric AG // Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts).map { case Rating(user, product, rate) => ((user, product), rate) } val ratesAndPreds = ratings.map { case Rating(user, product, rate) => ((user, product), rate)}.join(predictions) val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) => val err = (r1 - r2); err * err }.mean() println("Mean Squared Error = " + MSE) COLLABORATIVE FILTERING (ALS) 18.06.2015 50
  42. 42. codecentric AG – Setup, Codes & Examples on Github  https://github.com/mniehoff/spark-cassandra-playground – Install Cassandra  For local experiments: Cassandra Cluster Manager https://github.com/pcmanus/ccm  ccm create test -v binary:2.0.5 -n 3 –s  ccm status  ccm node1 status – Install Spark  Tar Ball (Pre Built for Hadoop 2.6)  HomeBrew  MacPorts CURIOUS? GETTING STARTED! 18.06.2015 51
  43. 43. codecentric AG – Spark Cassandra Connector  https://github.com/datastax/spark-cassandra-connector  Clone, ./sbt/sbt assembly or  Download Jar @ Maven Central – Start Spark shell spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0- SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost – Import Cassandra classes scala> import com.datastax.spark.connector._, CURIOUS? GETTING STARTED! 18.06.2015 52
  44. 44. codecentric AG – Production ready Cassandra – Integrates  Spark  Hadoop  Solr – OpsCenter & DevCenter – Bundled and ready to go – Free for development and test – http://www.datastax.com/download (One time registration required) DATASTAX ENTERPRISE EDITION 18.06.2015 53
  45. 45. codecentric AG QUESTIONS? 18.06.2015 57
  46. 46. codecentric AG Matthias Niehoff codecentric AG Zeppelinstraße 2 76185 Karlsruhe tel +49 (0) 721 – 95 95 681 matthias.niehoff@codecentric.de KONTAKT 18.06.2015 58

×