Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
IBM | spark.tc
Cassandra Summit 2015
Real time Advanced Analytics with Spark and Cassandra
Chris Fregly, Principal Data So...
IBM | spark.tc
Who am I?
Streaming Platform Engineer
Not a Photographer or Model
Streaming Data Engineer
Netflix Open Sour...
IBM | spark.tc
Advanced Apache Spark Meetup (Organizer)
Total Spark Experts: ~1000
Mean RSVPs per Meetup: ~300
Mean Attend...
IBM | spark.tc
Recent and Future Meetups
Spark-Cassandra Connector w/ Russell Spitzer (DataStax) & Me
Sept 21st, 2015 <-- ...
IBM | spark.tc
Topics of this Talk
① Recommendations
② Live Interactice Demo!
③ DataFrames
④ Catalyst Optimizer and Query ...
Live, Interactive Demo!
Spark After Dark
Generating High-Quality Dating Recommendations
Real-time Advanced Analytics
Machi...
IBM | spark.tc
Recommendations
Non-Personalized
“Cold Start” Problem
Top K
PageRank
Personalized
User-User, User-Item, Ite...
IBM | spark.tc
Types of User Feedback
Explicit
ratings, likes
Implicit
searches, clicks, hovers, views, scrolls, pauses
Us...
IBM | spark.tc
Similarity
①Euclidean: linear measure
Suffers from magnitude bias
②Cosine: angle measure
Adjust for magnitu...
IBM | spark.tc
Comparing Similarity
All-pairs Similarity
aka. Pair-wise Similarity, Similarity join
Naïve Implementation
O...
IBM | spark.tc
Audience Participation Required!!
Instructions for you
①Navigate to
sparkafterdark.com
②Click on
3 actors &...
IBM | spark.tc
DataFrames
Inspired by R and Pandas DataFrames
Cross language support
SQL, Python, Scala, Java, R
Levels pe...
IBM | spark.tc
Catalyst Optimizer
Converts logical plan to physical plan
Manipulate & optimize DataFrame transformation tr...
IBM | spark.tc
Plan Debugging
gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explai...
IBM | spark.tc
Plan Visualization & Join/Aggregation Metrics
Effectiveness
of Filter
Cost-based
Optimization
is Applied
Pe...
IBM | spark.tc
Data Sources API
Execution (o.a.s.sql.execution.commands.scala)
RunnableCommand (trait/interface)
ExplainCo...
IBM | spark.tc
Creating a Custom Data Source
Study Existing Native and Third-Party Data Source Impls
Native: JDBC (o.a.s.s...
IBM | spark.tc
Contributing a Custom Data Source
spark-packages.org
Managed by
Contains links to externally-managed github...
Partitions, Pruning, Pushdowns
IBM | spark.tc
Demo Dataset (from previous Spark After Dark
talks)
RATINGS
========
UserID,ProfileID,Rating
(1-10)
GENDERS...
IBM | spark.tc
Partitions
Partition based on data usage patterns
/genders.parquet/gender=M/…
/gender=F/… <-- Use case: acc...
IBM | spark.tc
Pruning
Partition Pruning
Filter out entire partitions of rows on partitioned data
SELECT id, gender FROM g...
IBM | spark.tc
Pushdowns
Predicate (aka Filter) Pushdowns
Predicate returns {true, false} for a given function/condition
F...
IBM | spark.tc
Cassandra Pushdown Rules
Determines which filter predicates can be pushed down to Cassandra.
* 1. Only push...
Native Spark SQL Data Sources
IBM | spark.tc
Spark SQL Native Data Sources - Source Code
IBM | spark.tc
JSON Data Source
DataFrame
val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datase...
IBM | spark.tc
JDBC Data Source
Add Driver to Spark JVM System Classpath
$ export SPARK_CLASSPATH=<jdbc-driver.jar>
DataFr...
IBM | spark.tc
Parquet Data Source
Configuration
spark.sql.parquet.filterPushdown=true
spark.sql.parquet.mergeSchema=true
...
IBM | spark.tc
ORC Data Source
Configuration
spark.sql.orc.filterPushdown=true
DataFrames
val gendersDF = sqlContext.read....
Third-Party Data Sources
spark-packages.org
IBM | spark.tc
CSV Data Source (Databricks)
Github
https://github.com/databricks/spark-csv
Maven
com.databricks:spark-csv_...
IBM | spark.tc
Avro Data Source (Databricks)
Github
https://github.com/databricks/spark-avro
Maven
com.databricks:spark-av...
IBM | spark.tc
Redshift Data Source (Databricks)
Github
https://github.com/databricks/spark-redshift
Maven
com.databricks:...
IBM | spark.tc
ElasticSearch Data Source (Elastic.co)
Github
https://github.com/elastic/elasticsearch-hadoop
Maven
org.ela...
IBM | spark.tc
Cassandra Data Source (DataStax)
Github
https://github.com/datastax/spark-cassandra-connector
Maven
com.dat...
IBM | spark.tc
REST Data Source (Databricks)
Coming Soon!
https://github.com/databricks/spark-rest?
Michael Armbrust
Spark...
IBM | spark.tc
DynamoDB Data Source (IBM Spark Tech Center)
Coming Soon!
https://github.com/cfregly/spark-dynamodb
Me Erli...
IBM | spark.tc
SparkSQL Performance Tuning (oas.sql.SQLConf)
spark.sql.inMemoryColumnarStorage.compressed=true
Automatical...
IBM | spark.tc
Freg-a-palooza Upcoming World Tour
① New York Strata (Sept 29th – Oct 1st)
② London Spark Meetup (Oct 12th)...
IBM Spark Tech Center is Hiring!
Only Fun, Collaborative People - No Erlichs!
IBM | spark.tc
Sign up for our newsletter ...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming SlideShare
Loading in …5
×

Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

3,597 views

Published on

Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

Published in: Software

Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

  1. 1. IBM | spark.tc Cassandra Summit 2015 Real time Advanced Analytics with Spark and Cassandra Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Sept 24, 2015 Power of data. Simplicity of design. Speed of innovation.
  2. 2. IBM | spark.tc Who am I? Streaming Platform Engineer Not a Photographer or Model Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer Apache Contributor Principal Data Solutions Engineer IBM Technology Center
  3. 3. IBM | spark.tc Advanced Apache Spark Meetup (Organizer) Total Spark Experts: ~1000 Mean RSVPs per Meetup: ~300 Mean Attendance: ~50-60% of RSVPs I’m lucky to work for a company/boss that let’s me do this full-time! Come work with me! We’ll kick ass!
  4. 4. IBM | spark.tc Recent and Future Meetups Spark-Cassandra Connector w/ Russell Spitzer (DataStax) & Me Sept 21st, 2015 <-- Great turnout and interesting questions! Project Tungsten Data Structs+Algos: CPU & Memory Optimizations Nov 12th, 2015 Text-based Advanced Analytics and Machine Learning Jan 14th, 2016 ElasticSearch-Spark Connector w/ Costin Leau (Elastic.co) & Me Feb 16th, 2016 Spark Internals Deep Dive Mar 24th, 2016 Spark SQL Catalyst Optimizer Deep Dive Apr 21st, 2016
  5. 5. IBM | spark.tc Topics of this Talk ① Recommendations ② Live Interactice Demo! ③ DataFrames ④ Catalyst Optimizer and Query Plans ⑤ Data Sources API ⑥ Creating and Contributing Custom Data Source ① Partitions, Pruning, Pushdowns ① Native + Third-Party Data Source Impls ① Spark SQL Performance Tuning
  6. 6. Live, Interactive Demo! Spark After Dark Generating High-Quality Dating Recommendations Real-time Advanced Analytics Machine Learning, Graph Processing
  7. 7. IBM | spark.tc Recommendations Non-Personalized “Cold Start” Problem Top K PageRank Personalized User-User, User-Item, Item-Item Collaborative Filtering
  8. 8. IBM | spark.tc Types of User Feedback Explicit ratings, likes Implicit searches, clicks, hovers, views, scrolls, pauses Used to train models for future recommendations
  9. 9. IBM | spark.tc Similarity ①Euclidean: linear measure Suffers from magnitude bias ②Cosine: angle measure Adjust for magnitude bias ③Jaccard: Set intersection / union Suffers from popularity bias ④Log Likelihood Adjust for popularity bias Ali Matei Reynold Patrick Andy Kimberly 1 1 1 1 Leslie 1 1 Meredith 1 1 1 Lisa 1 1 1 Holden 1 1 1 1 1
  10. 10. IBM | spark.tc Comparing Similarity All-pairs Similarity aka. Pair-wise Similarity, Similarity join Naïve Implementation O(m * n^2) shuffle; m = rows, n = cols Clever Approximate Reduce m: Sampling and bucketing Reduce n: Sparse matrix, factor out frequent vals (0?) Locality Sensitive Hashing
  11. 11. IBM | spark.tc Audience Participation Required!! Instructions for you ①Navigate to sparkafterdark.com ②Click on 3 actors & 3 actresses -> You are here -> github.com/fluxcapacitor/ hub.docker.com/r/fluxcapacitor/pipeline/
  12. 12. IBM | spark.tc DataFrames Inspired by R and Pandas DataFrames Cross language support SQL, Python, Scala, Java, R Levels performance of Python, Scala, Java, and R Generates JVM bytecode vs serialize/pickle objects to Python DataFrame is Container for Logical Plan Transformations are lazy and represented as a tree Catalyst Optimizer creates physical plan DataFrame.rdd returns the underlying RDD if needed Custom UDF using registerFunction() New, experimental UDAF support Use DataFrames instead of RDDs!!
  13. 13. IBM | spark.tc Catalyst Optimizer Converts logical plan to physical plan Manipulate & optimize DataFrame transformation tree Subquery elimination – use aliases to collapse subqueries Constant folding – replace expression with constant Simplify filters – remove unnecessary filters Predicate/filter pushdowns – avoid unnecessary data load Projection collapsing – avoid unnecessary projections Hooks for custom rules Rules = Scala Case Classes val newPlan = MyFilterRule(analyzedPlan) Implements oas.sql.catalyst.rules.Rule Apply to any plan stage
  14. 14. IBM | spark.tc Plan Debugging gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true) Requires explain(true) DataFrame.queryExecution.logical DataFrame.queryExecution.analyzed DataFrame.queryExecution.optimizedPlan DataFrame.queryExecution.executedPlan
  15. 15. IBM | spark.tc Plan Visualization & Join/Aggregation Metrics Effectiveness of Filter Cost-based Optimization is Applied Peak Memory for Joins and Aggs Optimized CPU-cache-aware Binary Format Minimizes GC & Improves Join Perf (Project Tungsten) New in Spark 1.5!
  16. 16. IBM | spark.tc Data Sources API Execution (o.a.s.sql.execution.commands.scala) RunnableCommand (trait/interface) ExplainCommand(impl: case class) CacheTableCommand(impl: case class) Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class) TableScan (impl: returns all rows) PrunedFilteredScan (impl: column pruning and predicate pushdown) InsertableRelation (impl: insert or overwrite data using SaveMode) Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class for all filter pushdowns for this data source) EqualTo GreaterThan StringStartsWith
  17. 17. IBM | spark.tc Creating a Custom Data Source Study Existing Native and Third-Party Data Source Impls Native: JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation Third-Party: Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation
  18. 18. IBM | spark.tc Contributing a Custom Data Source spark-packages.org Managed by Contains links to externally-managed github projects Ratings and comments Spark version requirements of each package Examples https://github.com/databricks/spark-csv https://github.com/databricks/spark-avro https://github.com/databricks/spark-redshift
  19. 19. Partitions, Pruning, Pushdowns
  20. 20. IBM | spark.tc Demo Dataset (from previous Spark After Dark talks) RATINGS ======== UserID,ProfileID,Rating (1-10) GENDERS ======== UserID,Gender (M,F,U) <-- Totally --> Anonymous
  21. 21. IBM | spark.tc Partitions Partition based on data usage patterns /genders.parquet/gender=M/… /gender=F/… <-- Use case: access users by gender /gender=U/… Partition Discovery On read, infer partitions from organization of data (ie. gender=F) Dynamic Partitions Upon insert, dynamically create partitions Specify field to use for each partition (ie. gender) SQL: INSERT TABLE genders PARTITION (gender) SELECT … DF: gendersDF.write.format(”parquet").partitionBy(”gender”).save(…)
  22. 22. IBM | spark.tc Pruning Partition Pruning Filter out entire partitions of rows on partitioned data SELECT id, gender FROM genders where gender = ‘U’ Column Pruning Filter out entire columns for all rows if not required Extremely useful for columnar storage formats Parquet, ORC SELECT id, gender FROM genders
  23. 23. IBM | spark.tc Pushdowns Predicate (aka Filter) Pushdowns Predicate returns {true, false} for a given function/condition Filters rows as deep into the data source as possible Data Source must implement PrunedFilteredScan
  24. 24. IBM | spark.tc Cassandra Pushdown Rules Determines which filter predicates can be pushed down to Cassandra. * 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate * 2. Only push down primary key column predicates with = or IN predicate. * 3. If there are regular columns in the pushdown predicates, they should have * at least one EQ expression on an indexed column and no IN predicates. * 4. All partition column predicates must be included in the predicates to be pushed down, * only the last part of the partition key can be an IN predicate. For each partition column, * only one predicate is allowed. * 5. For cluster column predicates, only last predicate can be non-EQ predicate * including IN predicate, and preceding column predicates must be EQ predicates. * If there is only one cluster column predicate, the predicates could be any non-IN predicate. * 6. There is no pushdown predicates if there is any OR condition or NOT IN condition. * 7. We're not allowed to push down multiple predicates for the same column if any of them * is equality or IN predicate. spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala
  25. 25. Native Spark SQL Data Sources
  26. 26. IBM | spark.tc Spark SQL Native Data Sources - Source Code
  27. 27. IBM | spark.tc JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or -- val ratingsDF = sqlContext.read.json ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") Convenience Method
  28. 28. IBM | spark.tc JDBC Data Source Add Driver to Spark JVM System Classpath $ export SPARK_CLASSPATH=<jdbc-driver.jar> DataFrame val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load() SQL CREATE TABLE genders USING jdbc OPTIONS (url, dbtable, driver, …)
  29. 29. IBM | spark.tc Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")
  30. 30. IBM | spark.tc ORC Data Source Configuration spark.sql.orc.filterPushdown=true DataFrames val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders") gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders") SQL CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders")
  31. 31. Third-Party Data Sources spark-packages.org
  32. 32. IBM | spark.tc CSV Data Source (Databricks) Github https://github.com/databricks/spark-csv Maven com.databricks:spark-csv_2.10:1.2.0 Code val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv") .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender") toDF() defines column names
  33. 33. IBM | spark.tc Avro Data Source (Databricks) Github https://github.com/databricks/spark-avro Maven com.databricks:spark-avro_2.10:2.0.1 Code val df = sqlContext.read .format("com.databricks.spark.avro") .load("file:/root/pipeline/datasets/dating/gender.avro")
  34. 34. IBM | spark.tc Redshift Data Source (Databricks) Github https://github.com/databricks/spark-redshift Maven com.databricks:spark-redshift:0.5.0 Code val df: DataFrame = sqlContext.read .format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load() Copies to S3 for fast, parallel reads vs single Redshift Master bottleneck
  35. 35. IBM | spark.tc ElasticSearch Data Source (Elastic.co) Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document>")
  36. 36. IBM | spark.tc Cassandra Data Source (DataStax) Github https://github.com/datastax/spark-cassandra-connector Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write.format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"dating","table"->"ratings")) .save()
  37. 37. IBM | spark.tc REST Data Source (Databricks) Coming Soon! https://github.com/databricks/spark-rest? Michael Armbrust Spark SQL Lead @ Databricks
  38. 38. IBM | spark.tc DynamoDB Data Source (IBM Spark Tech Center) Coming Soon! https://github.com/cfregly/spark-dynamodb Me Erlich
  39. 39. IBM | spark.tc SparkSQL Performance Tuning (oas.sql.SQLConf) spark.sql.inMemoryColumnarStorage.compressed=true Automatically selects column codec based on data spark.sql.inMemoryColumnarStorage.batchSize Increase as much as possible without OOM – improves compression and GC spark.sql.inMemoryPartitionPruning=true Enable partition pruning for in-memory partitions spark.sql.tungsten.enabled=true Code Gen for CPU and Memory Optimizations (Tungsten aka Unsafe Mode) spark.sql.shuffle.partitions Increase from default 200 for large joins and aggregations spark.sql.autoBroadcastJoinThreshold Increase to tune this cost-based, physical plan optimization spark.sql.hive.metastorePartitionPruning Predicate pushdown into the metastore to prune partitions early spark.sql.planner.sortMergeJoin Prefer sort-merge (vs. hash join) for large joins spark.sql.sources.partitionDiscovery.enabled & spark.sql.sources.parallelPartitionDiscovery.threshold
  40. 40. IBM | spark.tc Freg-a-palooza Upcoming World Tour ① New York Strata (Sept 29th – Oct 1st) ② London Spark Meetup (Oct 12th) ③ Scotland Data Science Meetup (Oct 13th) ④ Dublin Spark Meetup (Oct 15th) ⑤ Barcelona Spark Meetup (Oct 20th) ⑥ Madrid Spark Meetup (Oct 22nd) ⑦ Amsterdam Spark Summit (Oct 27th – Oct 29th) ⑧ Delft Dutch Data Science Meetup (Oct 29th) ⑨ Brussels Spark Meetup (Oct 30th) ⑩ Zurich Big Data Developers Meetup (Nov 2nd) High probability I’ll end up in jail
  41. 41. IBM Spark Tech Center is Hiring! Only Fun, Collaborative People - No Erlichs! IBM | spark.tc Sign up for our newsletter at Thank You! Power of data. Simplicity of design. Speed of innovation. Chris Fregly @cfregly
  42. 42. Power of data. Simplicity of design. Speed of innovation. IBM Spark

×