Ruby on Big Data @ Philly Ruby Group
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Ruby on Big Data @ Philly Ruby Group

on

  • 3,099 views

Presentation on Virgil, accessing Cassandra, Hadoop and Ruby via REST.

Presentation on Virgil, accessing Cassandra, Hadoop and Ruby via REST.

Statistics

Views

Total Views
3,099
Views on SlideShare
3,099
Embed Views
0

Actions

Likes
4
Downloads
35
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Ruby on Big Data @ Philly Ruby Group Presentation Transcript

  • 1. Ruby on Big Data Brian O’NeillLead Architect, Health Market Science (HMS) email: bone@alumni.brown.edu blog: http://brianoneill.blogspot.com/ The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other organizations mentioned.
  • 2. AgendaBig Data Orientation Cassandra Hadoop SOLR StormDEMOJava/Ruby InteroperabilityAdvanced Ideas Rails Integration Combing Real-time w/ Batch Processing (The Final Frontier)
  • 3. “Big” DataSize doesn’t always matter, it may bewhat your doing with it e.g. Natural-Language ProcessingFlexibility was our major motivator Data sources with disparate schema
  • 4. Decomposing the ProblemData Processing Storage Distributed Indexing Batch Querying Real-time
  • 5. NoSQL StorageBASE Basic Availability Soft-state Eventual consistencySimple API REST + JSON
  • 6. Heritage
  • 7. Cassandra’s Data Model Keyspaces Column Families Rows (Sorted by KEY!) Columns (Name : Value)
  • 8. ExampleBeerGuys (Keyspace) Users (Column Families) bonedog (Row) firstName : Brian lastName : O’Neill lisa (Row) firstName : Lisa lastName : O’Neill maidenName : Kelley
  • 9. Cassandra ArchitectureRing Architecture A (N-Z) Hash(key) -> Node e.g. md5(“Brian”) = “S” F (A-F) Written to Node A Client M (G-M)
  • 10. ReplicationFault Tolerance Written to next N nodes in ring. A (S-Z) Can be made datacenter aware. S Fe.g. md5(“Brian”) -> “S” (N-S) (A-F)Written to Nodes A and L L Client M (G-L) (L-M)
  • 11. Consistency Levels ONE 1st ResponseQUORUM N / 2 + 1 Replicas ALL All Replicas READ & WRITE
  • 12. Time & IdempotencyOrder Operation Time 1/10/2012 1 INSERT “Brian” into Users @11:15:00 EST 1/10/2012 2 DELETE from Users “Brian” @11:11:00 EST ! Every mutation is an insert! a re ew er b Latest timestamp wins. Bu y
  • 13. Why NoSQL for us?FlexibilityA new data processing paradigm Instead of bringing the data to the processing (In and Out of a relational database) Do this: Processing Data
  • 14. Batch Processing DATA JOB ADistributable (T-A)ScalableData Locality S HDFS H (I-R) (B-G)
  • 15. Map / Reducetuple = (key, value)map(x) -> tuple[]reduce(key, value[]) -> tuple[]
  • 16. Word CountThe Code The Rundef map(doc) doc1 = “boy meets girl” doc.each do |word| doc2 = ”girl likes boy”) emit(word, 1) map (doc1) -> (boy, 1), (meets, 1), (girl, 1) end map (doc2) -> (girl, 1), (likes, 1), (boy, 1)end reduce (boy, [1, 1]) -> (boy, 2)def reduce(key, values[]) reduce (girl, [1, 1]) -> (girl, 2) sum = values.inject {|sum,x| sum + x } reduce (likes [1]) -> (likes, 1) emit(key, sum) reduce (meets, [1]) -> (meets, 1)end
  • 17. Putting it Together A (T-A) S Storm H (I-R) (B-G)
  • 18. But...We love Ruby! and it’s all in Java. :(That’s okay, becauseWe love REST!
  • 19. Why Rest?Client???
  • 20. Why Ruby?Javacassandra/examples/hadoop_word_count-> find . -name *.java./src/WordCount.java./src/WordCountCounters.java./src/WordCountSetup.javacassandra/examples/hadoop_word_count-> wc -l 495Ruby d e! covirgil-1.0.5.1-SNAPSHOT/example-> wc -l wordcount.rb 22 wordcount.rbvirgil-1.0.5.1-SNAPSHOT/example-> wc -l demo.sh i n 22 demo.sh io n t d uc re 9 0%
  • 21. Virgil: REST Layer CRUD via HTTP Map/Reduce via HTTP AClient S H Storm
  • 22. DEMO
  • 23. Java InteroperabilityConventional Interoperability I/O Streams bet ween processesHadoop StreamingStorm Multilang
  • 24. Better? Use JRuby Single Process Parse Once / Eval ManyJSR 223 ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby"); ScriptContext context = new SimpleScriptContext(); Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE); bindings.put("variable", "value"); ENGINE.eval(script, context);Redbridge this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT); this.rubyReceiver = rubyContainer.runScriptlet(script); container.callMethod(rubyReceiver, "foo", "value");
  • 25. CRUD via HTTPhttp://virgil/data/{keyspace}/{columnFamily}/{column}/{row} PUT : Replaces Content of Row/Column GET : Retrieves Value of a Row/Column DELETE : Removes Value of a Row/Column A curl S H
  • 26. Map/Reduce over HTTP wordcount.rbdef map(rowKey, columns) result = [] columns.each do |column_name, value| words = value.split A words.each do |word| result << [word, "1"] end end curl return resultenddef reduce(key, values) rows = {} total = 0 S H columns = {} values.each do |value| total += value.to_i end columns["count"] = total.to_s rows[key] = columns return rowsend CF in CF out
  • 27. hydra = Typhoeus::Hydra.newwhile(line = file.gets) Typhoeus body = "{ "sentence" : "line" }" request = Typhoeus::Request.new("http://localhost:8080/virgil/data/dump/#{id}", :body => body, :method => :patch, :headers => {}, :timeout => 5000, # milliseconds :cache_timeout => 60, # seconds :params => {}) request.on_complete do |response| if response.success? $processed=$processed+1 if ($processed % 1000 == 0) then puts("Processed #{$processed} records.") end elsif response.timed_out? $time_outs=$time_outs+1 elsif response.code == 0 $faults=$faults+1 else $failures=$failures+1 end end hydra.queue requestendhydra.run
  • 28. Rails Integration? A Balancer Load ta Da g S H sin es oc Pr “REST is the new JDBC”
  • 29. Advance Topics REAL-TIME PROCESSING
  • 30. Real-time ProcessingDeals with data streams Storm tuple Bolt tuple Spout Bolt tuple tuple tuple Bolt Spout Bolt tuple tuple Bolt
  • 31. Ratch Processing (Combing Real-time and Batch)Data Flows as: Cascading Map/Reduce jobs Storm Topologies?Can’t we have one framework to rulethem all?
  • 32. Appendix
  • 33. Relational StorageACID Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when ser vers restart etc.
  • 34. Relational StorageBenefits Limitations Data Integrity Static Schemas Ubiquity Scalability
  • 35. IndexingReal-time AnswersFull-text queries Fuzzy SearchingNickname analysisGeospatial and Temporal Search
  • 36. Storage Options
  • 37. Queries / Flows HivePig Cascading
  • 38. Indexing Options