Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MongoDB + Spark
@blimpyacht
Level Setting
TROUGH OF
DISILLUSIONMENT
HDFS
Distributed Data
HDFS
YARN
Distributed Resources
HDFS
YARN
MapReduce
Distributed Processing
HDFS
YARN
Hive
Pig
Domain Specific Languages
MapReduce
Interactive Shell
Easy (-er)
Caching
HDFS
Distributed Data
HDFS
YARN
Distributed Resources
HDFS
YARN
SparkHadoop
Distributed Processing
HDFS
YARN
SparkHadoop
Mesos
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
Spark
Streaming
Hive
Spark
Shell
Mesos
Hadoop
Pig
Spark
SQL
Spark
Stand
Alone
YARN
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
executor
Worker
Node
executor
Worker
Node
Driver
Resilient Distributed Datasets
Parallelization
Parellelize = x
Transformation
s
Parellelize = x t(x) = x’ t(x’) = x’’
Transformations
filter( func )
union( func )
intersection( set )
distinct( n )
map( function )
Action
f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’
Actions
collect()
count()
first()
take( n )
reduce( function )
Lineage
f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’
Transform Transform ActionParallelize
Lineage
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transfor...
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transfor...
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transfor...
https://github.com/mongodb/mongo-hadoop
{
"_id" : ObjectId("4f16fc97d1e2d32371003e27"),
"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox...
{
"_id" : ObjectId("4f16fc97d1e2d32371003e27"),
"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox...
{
_id : "gretchen.hardeway@enron.com|shirley.crenshaw@enron.com",
value : 2
}
{
_id : "kmccomb@austin-mccomb.com|brian@enr...
Eratosthenes
Democritus
Hypatia
Shemp
Euripides
Spark Configuration
Configuration conf = new Configuration();
conf.set(
"mongo.job.input.format",
"com.mongodb.hadoop.Mong...
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.cl...
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.cl...
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.cl...
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.cl...
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.cl...
mongos mongos
Data
Services
Deployment Artifacts
Hadoop
Connector Jar
Fat Jar
Java Driver Jar
Spark Submit
/usr/local/spark-1.5.1/bin/spark-submit 
--class com.mongodb.spark.examples.DataframeExample 
--master local ...
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
JavaRDD<Message> messages = documents.map (
new Function<Tuple2<Object, BSONObject>, Message>() {
public Message call(Tupl...
MognoDB & Spack
code demo
THE FUTURE
AND
BEYOND THE INFINITE
Stand
Alone
YAR
N
Spark
Meso
s
Spark
SQL
Spark
Shell
Spark
Streaming
MongoDB + Spark
THANKS!
{
name: ‘Bryan Reinero’,
role: ‘Developer Advocate’,
twitter: ‘@blimpyacht’,
email: ‘bryan@mongodb.com’
}
MongoDB & Spark
MongoDB & Spark
MongoDB & Spark
MongoDB & Spark
MongoDB & Spark
MongoDB & Spark
Upcoming SlideShare
Loading in …5
×

MongoDB & Spark

3,159 views

Published on

Spark is a powerful framework for distributed processing of massive datasets. With an interactive shell, machine learning libraries, and in-memory data structures, Spark provides a tool set for high performance advanced analytics. Connecting Spark with MongoDB enables you to achieve sophisticated back-end analytics in combination with the performance of MongoDB. We'll take a look at how these two systems integrate with one another through sample code and demonstrations.

Presentation from Bryan Reneiro, Developer Advocate at MongoDB.

Published in: Technology
  • @JacekLaskowski So glad to hear you enjoyed the presentation! You can locate a recording of this presentation here: https://youtu.be/MPPwn1XmhzQ. The video begins with MongoDB CTO and co-founder Eliot Horowitz presenting on aggregation under the hood and discussing patterns and anti-patterns of $lookup. Eliot's presentation is followed by Bryan Reinero's presentation on MongoDB and Spark. We hope you find this video helpful!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I must admit that the way Hadoop and Spark worlds are presented in the slides is simply AWESOME. The "distributed" part of both is so well laid out. I really wish I could watch the video from the presentation! Anyone?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

MongoDB & Spark

  1. 1. MongoDB + Spark @blimpyacht
  2. 2. Level Setting
  3. 3. TROUGH OF DISILLUSIONMENT
  4. 4. HDFS Distributed Data
  5. 5. HDFS YARN Distributed Resources
  6. 6. HDFS YARN MapReduce Distributed Processing
  7. 7. HDFS YARN Hive Pig Domain Specific Languages MapReduce
  8. 8. Interactive Shell Easy (-er) Caching
  9. 9. HDFS Distributed Data
  10. 10. HDFS YARN Distributed Resources
  11. 11. HDFS YARN SparkHadoop Distributed Processing
  12. 12. HDFS YARN SparkHadoop Mesos
  13. 13. HDFS Stand Alone YARN SparkHadoop Mesos
  14. 14. HDFS Stand Alone YARN SparkHadoop Mesos Hive Pig
  15. 15. HDFS Stand Alone YARN SparkHadoop Mesos Hive Pig Spark Shell
  16. 16. HDFS Stand Alone YARN SparkHadoop Mesos Hive Pig Spark Shell Spark Streaming
  17. 17. HDFS Stand Alone YARN SparkHadoop Mesos Hive Pig Spark SQL Spark Shell Spark Streaming
  18. 18. HDFS Stand Alone YARN SparkHadoop Mesos Hive Pig Spark SQL Spark Shell Spark Streaming
  19. 19. HDFS Stand Alone YARN SparkHadoop Mesos Hive Pig Spark SQL Spark Shell Spark Streaming
  20. 20. Stand Alone YARN SparkHadoop Mesos Hive Pig Spark SQL Spark Shell Spark Streaming
  21. 21. Spark Streaming Hive Spark Shell Mesos Hadoop Pig Spark SQL Spark Stand Alone YARN
  22. 22. Stand Alone YARN Spark Mesos Spark SQL Spark Shell Spark Streaming
  23. 23. Stand Alone YARN Spark Mesos Spark SQL Spark Shell Spark Streaming
  24. 24. executor Worker Node executor Worker Node Driver Resilient Distributed Datasets
  25. 25. Parallelization Parellelize = x
  26. 26. Transformation s Parellelize = x t(x) = x’ t(x’) = x’’
  27. 27. Transformations filter( func ) union( func ) intersection( set ) distinct( n ) map( function )
  28. 28. Action f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’
  29. 29. Actions collect() count() first() take( n ) reduce( function )
  30. 30. Lineage f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’
  31. 31. Transform Transform ActionParallelize Lineage
  32. 32. Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Lineage
  33. 33. Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Lineage
  34. 34. Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Lineage
  35. 35. https://github.com/mongodb/mongo-hadoop
  36. 36. { "_id" : ObjectId("4f16fc97d1e2d32371003e27"), "body" : "the scrimmage is still up in the air. "subFolder" : "notes_inbox", "mailbox" : "bass-e", "filename" : "450.", "headers" : { "X-cc" : "", "From" : "michael.simmons@enron.com", "Subject" : "Re: Plays and other information", "X-Folder" : "Eric_Bass_Dec2000Notes FoldersNotes inbox", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "eric.bass@enron.com", "X-Origin" : "Bass-E", "X-FileName" : "ebass.nsf", "X-From" : "Michael Simmons", "Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)", "X-To" : "Eric Bass", "Message-ID" : "<6884142.1075854677416.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } }
  37. 37. { "_id" : ObjectId("4f16fc97d1e2d32371003e27"), "body" : "the scrimmage is still up in the air. "subFolder" : "notes_inbox", "lfpwoojjf0wig=-i1qf=q0qif0=i38 -00 1-8" : "bass-e", "filename" : "450.", "headers" : { "X-cc" : "", "From" : "michael.simmons@enron.com", "Subject" : "Re: Plays and other information", "X-Folder" : "Eric_Bass_Dec2000Notes FoldersNotes inbox", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "eric.bass@enron.com", "X-Origin" : "Bass-E", "X-FileName" : "ebass.nsf", "X-From" : "Michael Simmons", "Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)", "X-To" : "Eric Bass", "Message-ID" : "<6884142.1075854677416.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" }
  38. 38. { _id : "gretchen.hardeway@enron.com|shirley.crenshaw@enron.com", value : 2 } { _id : "kmccomb@austin-mccomb.com|brian@enron.com", value : 2 } { _id : "sally.beck@enron.com|sandy.stone@enron.com", value : 2 }
  39. 39. Eratosthenes Democritus Hypatia Shemp Euripides
  40. 40. Spark Configuration Configuration conf = new Configuration(); conf.set( "mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat” ); conf.set( "mongo.input.uri", "mongodb://localhost:27017/db.collection” );
  41. 41. Spark Context JavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf, MongoInputFormat.class, Object.class, BSONObject.class );
  42. 42. Spark Context JavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf, MongoInputFormat.class, Object.class, BSONObject.class );
  43. 43. Spark Context JavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf, MongoInputFormat.class, Object.class, BSONObject.class );
  44. 44. Spark Context JavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf, MongoInputFormat.class, Object.class, BSONObject.class );
  45. 45. Spark Context JavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf, MongoInputFormat.class, Object.class, BSONObject.class );
  46. 46. mongos mongos Data Services
  47. 47. Deployment Artifacts Hadoop Connector Jar Fat Jar Java Driver Jar
  48. 48. Spark Submit /usr/local/spark-1.5.1/bin/spark-submit --class com.mongodb.spark.examples.DataframeExample --master local Examples-1.0-SNAPSHOT.jar
  49. 49. Stand Alone YARN Spark Mesos Spark SQL Spark Shell Spark Streaming
  50. 50. JavaRDD<Message> messages = documents.map ( new Function<Tuple2<Object, BSONObject>, Message>() { public Message call(Tuple2<Object, BSONObject> tuple) { BSONObject header = (BSONObject)tuple._2.get("headers"); Message m = new Message(); m.setTo( (String) header.get("To") ); m.setX_From( (String) header.get("From") ); m.setMessage_ID( (String) header.get( "Message-ID" ) ); m.setBody( (String) tuple._2.get( "body" ) ); return m; } } );
  51. 51. MognoDB & Spack code demo
  52. 52. THE FUTURE AND BEYOND THE INFINITE
  53. 53. Stand Alone YAR N Spark Meso s Spark SQL Spark Shell Spark Streaming
  54. 54. MongoDB + Spark
  55. 55. THANKS! { name: ‘Bryan Reinero’, role: ‘Developer Advocate’, twitter: ‘@blimpyacht’, email: ‘bryan@mongodb.com’ }

×