Your SlideShare is downloading. ×
0
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Ruby on Big Data @ Philly Ruby Group

3,200

Published on

Presentation on Virgil, accessing Cassandra, Hadoop and Ruby via REST.

Presentation on Virgil, accessing Cassandra, Hadoop and Ruby via REST.

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,200
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
58
Comments
0
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. Ruby on Big Data Brian O’NeillLead Architect, Health Market Science (HMS) email: bone@alumni.brown.edu blog: http://brianoneill.blogspot.com/ The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other organizations mentioned.
    • 2. AgendaBig Data Orientation Cassandra Hadoop SOLR StormDEMOJava/Ruby InteroperabilityAdvanced Ideas Rails Integration Combing Real-time w/ Batch Processing (The Final Frontier)
    • 3. “Big” DataSize doesn’t always matter, it may bewhat your doing with it e.g. Natural-Language ProcessingFlexibility was our major motivator Data sources with disparate schema
    • 4. Decomposing the ProblemData Processing Storage Distributed Indexing Batch Querying Real-time
    • 5. NoSQL StorageBASE Basic Availability Soft-state Eventual consistencySimple API REST + JSON
    • 6. Heritage
    • 7. Cassandra’s Data Model Keyspaces Column Families Rows (Sorted by KEY!) Columns (Name : Value)
    • 8. ExampleBeerGuys (Keyspace) Users (Column Families) bonedog (Row) firstName : Brian lastName : O’Neill lisa (Row) firstName : Lisa lastName : O’Neill maidenName : Kelley
    • 9. Cassandra ArchitectureRing Architecture A (N-Z) Hash(key) -> Node e.g. md5(“Brian”) = “S” F (A-F) Written to Node A Client M (G-M)
    • 10. ReplicationFault Tolerance Written to next N nodes in ring. A (S-Z) Can be made datacenter aware. S Fe.g. md5(“Brian”) -> “S” (N-S) (A-F)Written to Nodes A and L L Client M (G-L) (L-M)
    • 11. Consistency Levels ONE 1st ResponseQUORUM N / 2 + 1 Replicas ALL All Replicas READ & WRITE
    • 12. Time & IdempotencyOrder Operation Time 1/10/2012 1 INSERT “Brian” into Users @11:15:00 EST 1/10/2012 2 DELETE from Users “Brian” @11:11:00 EST ! Every mutation is an insert! a re ew er b Latest timestamp wins. Bu y
    • 13. Why NoSQL for us?FlexibilityA new data processing paradigm Instead of bringing the data to the processing (In and Out of a relational database) Do this: Processing Data
    • 14. Batch Processing DATA JOB ADistributable (T-A)ScalableData Locality S HDFS H (I-R) (B-G)
    • 15. Map / Reducetuple = (key, value)map(x) -> tuple[]reduce(key, value[]) -> tuple[]
    • 16. Word CountThe Code The Rundef map(doc) doc1 = “boy meets girl” doc.each do |word| doc2 = ”girl likes boy”) emit(word, 1) map (doc1) -> (boy, 1), (meets, 1), (girl, 1) end map (doc2) -> (girl, 1), (likes, 1), (boy, 1)end reduce (boy, [1, 1]) -> (boy, 2)def reduce(key, values[]) reduce (girl, [1, 1]) -> (girl, 2) sum = values.inject {|sum,x| sum + x } reduce (likes [1]) -> (likes, 1) emit(key, sum) reduce (meets, [1]) -> (meets, 1)end
    • 17. Putting it Together A (T-A) S Storm H (I-R) (B-G)
    • 18. But...We love Ruby! and it’s all in Java. :(That’s okay, becauseWe love REST!
    • 19. Why Rest?Client???
    • 20. Why Ruby?Javacassandra/examples/hadoop_word_count-> find . -name *.java./src/WordCount.java./src/WordCountCounters.java./src/WordCountSetup.javacassandra/examples/hadoop_word_count-> wc -l 495Ruby d e! covirgil-1.0.5.1-SNAPSHOT/example-> wc -l wordcount.rb 22 wordcount.rbvirgil-1.0.5.1-SNAPSHOT/example-> wc -l demo.sh i n 22 demo.sh io n t d uc re 9 0%
    • 21. Virgil: REST Layer CRUD via HTTP Map/Reduce via HTTP AClient S H Storm
    • 22. DEMO
    • 23. Java InteroperabilityConventional Interoperability I/O Streams bet ween processesHadoop StreamingStorm Multilang
    • 24. Better? Use JRuby Single Process Parse Once / Eval ManyJSR 223 ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby"); ScriptContext context = new SimpleScriptContext(); Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE); bindings.put("variable", "value"); ENGINE.eval(script, context);Redbridge this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT); this.rubyReceiver = rubyContainer.runScriptlet(script); container.callMethod(rubyReceiver, "foo", "value");
    • 25. CRUD via HTTPhttp://virgil/data/{keyspace}/{columnFamily}/{column}/{row} PUT : Replaces Content of Row/Column GET : Retrieves Value of a Row/Column DELETE : Removes Value of a Row/Column A curl S H
    • 26. Map/Reduce over HTTP wordcount.rbdef map(rowKey, columns) result = [] columns.each do |column_name, value| words = value.split A words.each do |word| result << [word, "1"] end end curl return resultenddef reduce(key, values) rows = {} total = 0 S H columns = {} values.each do |value| total += value.to_i end columns["count"] = total.to_s rows[key] = columns return rowsend CF in CF out
    • 27. hydra = Typhoeus::Hydra.newwhile(line = file.gets) Typhoeus body = "{ "sentence" : "line" }" request = Typhoeus::Request.new("http://localhost:8080/virgil/data/dump/#{id}", :body => body, :method => :patch, :headers => {}, :timeout => 5000, # milliseconds :cache_timeout => 60, # seconds :params => {}) request.on_complete do |response| if response.success? $processed=$processed+1 if ($processed % 1000 == 0) then puts("Processed #{$processed} records.") end elsif response.timed_out? $time_outs=$time_outs+1 elsif response.code == 0 $faults=$faults+1 else $failures=$failures+1 end end hydra.queue requestendhydra.run
    • 28. Rails Integration? A Balancer Load ta Da g S H sin es oc Pr “REST is the new JDBC”
    • 29. Advance Topics REAL-TIME PROCESSING
    • 30. Real-time ProcessingDeals with data streams Storm tuple Bolt tuple Spout Bolt tuple tuple tuple Bolt Spout Bolt tuple tuple Bolt
    • 31. Ratch Processing (Combing Real-time and Batch)Data Flows as: Cascading Map/Reduce jobs Storm Topologies?Can’t we have one framework to rulethem all?
    • 32. Appendix
    • 33. Relational StorageACID Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when ser vers restart etc.
    • 34. Relational StorageBenefits Limitations Data Integrity Static Schemas Ubiquity Scalability
    • 35. IndexingReal-time AnswersFull-text queries Fuzzy SearchingNickname analysisGeospatial and Temporal Search
    • 36. Storage Options
    • 37. Queries / Flows HivePig Cascading
    • 38. Indexing Options

    ×