Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Click to edit Master text styles
Click to edit Master text styles

 

 

 

 

After Dark
Real-time Advanced Analytics, Ma...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Parallelism
11
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Performance
16
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Streaming
25
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Approximations
27
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Recommendations
31
Click to edit Master text styles
Click to edit Master text styles
Interactive Demo!
32
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Non-Personalized Recommendations
35
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Demo!
Spark SQL/DataFrames + GraphX/PageRank
38
Click to edit Master text styles
Click to edit Master text styles
Similarities
39
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Personalized Recommendations
45
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Demo!
Spark SQL/DataFrames + MLlib/Alternating Least Squ...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
The Future of Recommendations!
52
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
55
Maintaining the Spark!
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Final Recommendation!
57
Click to edit Master text styles
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of i...
Click to edit Master text styles
Click to edit Master text styles
Power of data. Simplicity of design. Speed of innovation...
Upcoming SlideShare
Loading in …5
×

Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

1,516 views

Published on

Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

http://www.meetup.com/Spark-Barcelona/events/225868339/

http://www.meetup.com/Advanced-Apache-Spark-Meetup/events/225815100/

Published in: Software

Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

  1. 1. Click to edit Master text styles Click to edit Master text styles After Dark Real-time Advanced Analytics, Machine Learning, 
 Graph Analytics, Text NLP, and Recommendations Barcelona Spark Meetup Oct 20th, 2015 Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center ** We’re Hiring!! Nice People Only, Please. **
  2. 2. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced (2016)
  3. 3. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Total Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc Surface and share the patterns and idioms of these well-designed, distributed, big data components
  4. 4. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 4 Core Spark Streaming real-timeSpark SQL structured data MLlib machine learning GraphX graph analytics … BlinkDB approx queries What is Spark?
  5. 5. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Deployments In Production 5
  6. 6. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Tools of the Talk 6   Redis   Docker   Cassandra   MLlib, GraphX   Parquet, JSON   Apache Zeppelin   Spark Streaming, Kafka   Spark SQL, DataFrames   Spark JDBC/ODBC Hive ThriftServer   ElasticSearch, Logstash, Kibana (ELK) and…
  7. 7. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark SMACK Stack! 7 S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)
  8. 8. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Themes of this Talk   Parallelism   Performance   Streaming   Approximations   Similarity Measures   Recommendations 8 and…
  9. 9. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Goals of Spark After Dark   Generate high-quality recommendations   Demonstrate Spark high-level libraries Spark Streaming -> Kafka, Approximates Spark SQL -> DataFrames, Cassandra   GraphX -> PageRank, Shortest Path   MLlib -> Matrix Factor, Word2Vec 9
  10. 10. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Popular Dating Sites 10
  11. 11. Click to edit Master text styles Click to edit Master text styles Parallelism 11
  12. 12. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark My First Experience With Parallelism Brady Bunch circa 1980 Season 5, Episode 18: “Two Pete’s in a Pod” 12
  13. 13. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parallel Algorithm: O(log n) 13 O(log n)
  14. 14. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Non-Parallel Algorithm: O(n) 14 O(n)
  15. 15. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark is Parallel! 15
  16. 16. Click to edit Master text styles Click to edit Master text styles Performance 16
  17. 17. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Beats Hadoop @ 100 TB GraySort 17   On-disk only   28,000 partitions   No in-memory caching (2014)(2013) (2014)
  18. 18. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Improved Shuffle and Network Layer   “Sort-based shuffle”   Minimize OS resources   Switched to async Netty   Keep CPUs hot   Reuse byte buffers to minimize GC   Use epoll for I/O to stay in kernel space 18
  19. 19. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Project Tungsten: CPU and Memory   More JVM bytecode generation, JIT optimize   CPU-cache-aware data structs and algos -->   Custom memory management Serializers Performance New HashMap 19
  20. 20. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataFrames and Catalyst Optimizer 20 20 https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ Please Use DataFrames! --> --> JVM bytecode generation
  21. 21. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Columnar Storage Format 21 Skip whole chunks with min-max heuristics
 stored in each chunk (sorted data only)
  22. 22. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet File Format  Based on Google Dremel  Implemented by Twitter and Cloudera  Columnar storage format  Optimized for fast columnar aggregations  Tight compression  Supports pushdowns  Nested, self-describing, evolving schema 22
  23. 23. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Compression   Run Length Encoding: Repeated data   Dictionary Encoding: Fixed set of values   Delta, Prefix Encoding: Sorted data 23
  24. 24. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Query Optimizations   Column, Partition Pruning   Row, Predicate Pushdown SELECT b FROM table WHERE a in [a2,a3] 24
  25. 25. Click to edit Master text styles Click to edit Master text styles Streaming 25
  26. 26. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Direct Kafka Streaming – KafkaRDD   No single Receiver, no Write Ahead Log (WAL)   Workers pull from Kafka in parallel   Each KafkaRDD partition stores relevant offsets   Upon Worker Node failure, rebuild from offsets   Optimizes happy path by avoiding the WAL 26 At least once delivery guarantee <--
  27. 27. Click to edit Master text styles Click to edit Master text styles Approximations 27
  28. 28. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Count Min Sketch   Approximate counters   Better than HashMap   Low, fixed memory   Known error bounds   Large num of counters   From Twitter’s Algebird   Streaming example in Spark codebase 28
  29. 29. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog   Approximate cardinality Approx count distinct !   From Twitter’s Algebird!   Low memory 1.5KB @ 2% error, 10^9 elements !   Streaming example in Spark codebase   RDD: countApproxDistinctByKey() 29
  30. 30. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Monte Carlo Simulations From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials
 Converge on expected value SparkPi example in Spark codebase
 Pi ~ (# red dots /
 # total dots * 4) 30
  31. 31. Click to edit Master text styles Click to edit Master text styles Recommendations 31
  32. 32. Click to edit Master text styles Click to edit Master text styles Interactive Demo! 32
  33. 33. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Audience Participation Needed! 33   Navigate to sparkafterdark.com   Click 3 actresses and 
 3 actors -> You are here ->
  34. 34. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Recommendations Non-personalized
 Cold Start No preference or behavior data for user, yet Personalized
 User-Item Similarity
 Items that others with similar prefs have liked Item-Item Similarity
 Items similar to your previously-liked items 34
  35. 35. Click to edit Master text styles Click to edit Master text styles Non-Personalized Recommendations 35
  36. 36. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Summary Statistics and Aggregations   Top Users by Like Count “I might like users with the highest sum aggregation of likes overall.” SparkSQL + DataFrame = Aggregations 36
  37. 37. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Graph Analytics   Top Influencers by Like Graph
 “I might like users who have the highest probability of 
 me liking them randomly while walking the like graph.” GraphX: PageRank 37
  38. 38. Click to edit Master text styles Click to edit Master text styles Demo! Spark SQL/DataFrames + GraphX/PageRank 38
  39. 39. Click to edit Master text styles Click to edit Master text styles Similarities 39
  40. 40. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Similarity Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias 40 Ali Matei Reynold Patrick Andy Kimberly 1 1 1 1 Leslie 1 1! Meredith 1 1 1 Lisa 1 1 1 Holden 1 1 1 1 1 z!
  41. 41. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis 41
  42. 42. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine Similarity 42
  43. 43. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50 github.com/mrsqueeze/spark-hash 43
  44. 44. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); 
 nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0) 44 (index,value) (index,value)
  45. 45. Click to edit Master text styles Click to edit Master text styles Personalized Recommendations 45
  46. 46. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Recommendation Terminology User User seeking recommendations Item Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering Dimension reduction 46
  47. 47. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Collaborative Filtering Personalized Recs   Like behavior of similar users
 “I like the same people that you like. 
 What other people did you like that I haven’t seen?” 
 MLlib: Matrix Factorization, User-Item Similarity 47
  48. 48. Click to edit Master text styles Click to edit Master text styles Demo! Spark SQL/DataFrames + MLlib/Alternating Least Squares 48
  49. 49. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Text-based Personalized Recs (1/3)   Similar profiles to me
 “Our profiles have similar, unique k-skip n-grams. 
 We might like each other.”
 MLlib: Word2Vec, TF/IDF, Doc Similarity 49
  50. 50. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Text Based Personalized Recs (2/3) 50  Similar profiles from my past likes
 “Your profile shares a similar feature vector space to 
 others that I’ve liked. I might like you.”
 MLlib: Word2Vec, TF/IDF, Doc Similarity
  51. 51. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Text-based Personalized Recs (3/3)   Relevant, High-Value Emails “Your initial email has similar named entities to my profile.
 I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition 51 ^ Her Email< My Profile
  52. 52. Click to edit Master text styles Click to edit Master text styles The Future of Recommendations! 52
  53. 53. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Facial Recognition   Eigenfaces
 “Your face looks similar to others that I’ve liked.
 I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity 53 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  54. 54. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Natural Language Processing: Convo Bot   NLP and DecisionTrees “If your responses to my trite opening 
 lines are positive, I may read your profile.” MLlib: TF/IDF, DecisionTree, 
 Sentiment Analysis 54 Positive Negative
  55. 55. Click to edit Master text styles Click to edit Master text styles 55 Maintaining the Spark!
  56. 56. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Recommendations for Couples   Pathways of Similarity
 “I want Mad Max. You want Message In a Bottle. 
 Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity
 GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors 56
  57. 57. Click to edit Master text styles Click to edit Master text styles Final Recommendation! 57
  58. 58. Click to edit Master text styles Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark  Get Off the Computer & Meet People! Thank you!! Chris Fregly @cfregly IBM Spark Tech Center San Francisco, CA, USA Relevant Links advancedspark.com Signup for the book and meetup! github.com/fluxcapacitor/pipeline Clone all code used today! hub.docker.com/r/fluxcapacitor/pipeline Run all demos presented today! 58 Image courtesy of http://www.duchess-france.org/
  59. 59. Click to edit Master text styles Click to edit Master text styles Power of data. Simplicity of design. Speed of innovation. IBM Spark

×