Your SlideShare is downloading. ×
0
Ruby on Big Data                 Brian O’NeillLead Architect, Health Market Science (HMS)      The views expressed herein ...
AgendaBig Data Orientation  Cassandra  Hadoop  SOLR  StormDEMOJava/Ruby InteroperabilityAdvanced Ideas  Rails Integration ...
“Big” DataSize doesn’t always matter, it may bewhat your doing with it e.g. Natural-Language ProcessingFlexibility was our...
Decomposing the       ProblemData         Processing Storage     Distributed Indexing    Batch Querying    Real-time
Relational StorageACID Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: ...
Relational StorageBenefits          Limitations Data Integrity   Static Schemas Ubiquity         Scalability
NoSQL StorageBASE Basic Availability Soft-state Eventual consistencySimple API REST + JSON
IndexingReal-time AnswersFull-text queries Fuzzy SearchingNickname analysisGeospatial and Temporal Search
Storage Options
Indexing Options
Why?Cassandra Consistency-level per operation Temporal dimension of an operation Idempotent mentalitySOLR Community Integr...
Cassandra’s Data Model   Keyspaces     Column Families                 Rows               (Sorted by KEY!)                ...
ExampleBeerGuys (Keyspace)  Users (Column Families)     bonedog (Row)        firstName : Brian        lastName : O’Neill  ...
Cassandra Architecture Ring Architecture         A                          (N-Z)  Hash(key) -> Node  Reliability         ...
Why NoSQL for us?FlexibilityA new data processing paradigm  Instead of:                Data          Processing  Do this: ...
Batch Processing                      DATA                JOB           ADistributable                 (T-A)ScalableData L...
Map / Reducetuple = (key, value)map(x) -> tuple[]reduce(key, value[]) -> tuple[]
Word CountThe Code                                   The Rundef map(doc)                                   doc1 = “boy mee...
Queries / Flows      HivePig          Cascading
Real-time ProcessingDeals with data streams                                    Storm          tuple   Bolt   tuple  Spout ...
Putting it Together          A          (T-A)  S      Storm     H (I-R)            (B-G)
But...We love Ruby! and it’s all in Java. :(That’s okay,  becauseWe love REST!
REST Layer         CRUD via HTTP         Map/Reduce via HTTP                                AClient                       ...
DEMO
Java InteroperabilityConventional Interoperability I/O Streams bet ween processesHadoop StreamingStorm Multilang
CRUD via HTTPhttp://virgil/data/{keyspace}/{columnFamily}/{column}/{row}                    PUT : Replaces Content of Row/...
Map/Reduce over HTTP       wordcount.rbdef map(rowKey, columns)    result = []    columns.each do |column_name, value|    ...
Better?                             Use JRuby                                 Single Process                              ...
Rails Integration                                        A                   Balancer                     Load            ...
Ratch Processing  (Combing Real-time and Batch)Data Flows as: Cascading Map/Reduce jobs Storm Topologies?Can’t we have one...
Upcoming SlideShare
Loading in...5
×

Ruby on Big Data (Cassandra + Hadoop)

13,774

Published on

Shows how to use Virgil to access the facilities of Hadoop and Cassandra from ruby using REST.

Published in: Technology
2 Comments
25 Likes
Statistics
Notes
No Downloads
Views
Total Views
13,774
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
165
Comments
2
Likes
25
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "Ruby on Big Data (Cassandra + Hadoop)"

    1. 1. Ruby on Big Data Brian O’NeillLead Architect, Health Market Science (HMS) The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other organizations mentioned.
    2. 2. AgendaBig Data Orientation Cassandra Hadoop SOLR StormDEMOJava/Ruby InteroperabilityAdvanced Ideas Rails Integration Combing Real-time w/ Batch Processing (The Final Frontier)
    3. 3. “Big” DataSize doesn’t always matter, it may bewhat your doing with it e.g. Natural-Language ProcessingFlexibility was our major motivator Data sources with disparate schema
    4. 4. Decomposing the ProblemData Processing Storage Distributed Indexing Batch Querying Real-time
    5. 5. Relational StorageACID Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when ser vers restart etc.
    6. 6. Relational StorageBenefits Limitations Data Integrity Static Schemas Ubiquity Scalability
    7. 7. NoSQL StorageBASE Basic Availability Soft-state Eventual consistencySimple API REST + JSON
    8. 8. IndexingReal-time AnswersFull-text queries Fuzzy SearchingNickname analysisGeospatial and Temporal Search
    9. 9. Storage Options
    10. 10. Indexing Options
    11. 11. Why?Cassandra Consistency-level per operation Temporal dimension of an operation Idempotent mentalitySOLR Community Integration (Solandra) NOT scalability and flexibility (sharding stinks)
    12. 12. Cassandra’s Data Model Keyspaces Column Families Rows (Sorted by KEY!) Columns (Name : Value)
    13. 13. ExampleBeerGuys (Keyspace) Users (Column Families) bonedog (Row) firstName : Brian lastName : O’Neill lisa (Row) firstName : Lisa lastName : O’Neill maidenName : Kelley
    14. 14. Cassandra Architecture Ring Architecture A (N-Z) Hash(key) -> Node Reliability F (A-F) Scalability Client M (G-M)
    15. 15. Why NoSQL for us?FlexibilityA new data processing paradigm Instead of: Data Processing Do this: Processing Data
    16. 16. Batch Processing DATA JOB ADistributable (T-A)ScalableData Locality S HDFS H (I-R) (B-G)
    17. 17. Map / Reducetuple = (key, value)map(x) -> tuple[]reduce(key, value[]) -> tuple[]
    18. 18. Word CountThe Code The Rundef map(doc) doc1 = “boy meets girl” doc.each do |word| doc2 = ”girl likes boy”) emit(word, 1) map (doc1) -> (boy, 1), (meets, 1), (girl, 1) end map (doc2) -> (girl, 1), (likes, 1), (boy, 1)end reduce (boy, [1, 1]) -> (boy, 2)def reduce(key, values[]) reduce (girl, [1, 1]) -> (girl, 2) sum = values.inject {|sum,x| sum + x } reduce (likes [1]) -> (likes, 1) emit(key, sum) reduce (meets, [1]) -> (meets, 1)end
    19. 19. Queries / Flows HivePig Cascading
    20. 20. Real-time ProcessingDeals with data streams Storm tuple Bolt tuple Spout Bolt tuple tuple tuple Bolt Spout Bolt tuple tuple Bolt
    21. 21. Putting it Together A (T-A) S Storm H (I-R) (B-G)
    22. 22. But...We love Ruby! and it’s all in Java. :(That’s okay, becauseWe love REST!
    23. 23. REST Layer CRUD via HTTP Map/Reduce via HTTP AClient S H Storm
    24. 24. DEMO
    25. 25. Java InteroperabilityConventional Interoperability I/O Streams bet ween processesHadoop StreamingStorm Multilang
    26. 26. CRUD via HTTPhttp://virgil/data/{keyspace}/{columnFamily}/{column}/{row} PUT : Replaces Content of Row/Column GET : Retrieves Value of a Row/Column DELETE : Removes Value of a Row/Column A curl S H
    27. 27. Map/Reduce over HTTP wordcount.rbdef map(rowKey, columns) result = [] columns.each do |column_name, value| words = value.split A words.each do |word| result << [word, "1"] end end curl return resultenddef reduce(key, values) rows = {} total = 0 S H columns = {} values.each do |value| total += value.to_i end columns["count"] = total.to_s rows[key] = columns return rowsend CF in CF out
    28. 28. Better? Use JRuby Single Process Parse Once / Eval ManyJSR 223 ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby"); ScriptContext context = new SimpleScriptContext(); Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE); bindings.put("variable", "value"); ENGINE.eval(script, context);Redbridge this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT); this.rubyReceiver = rubyContainer.runScriptlet(script); container.callMethod(rubyReceiver, "foo", "value");
    29. 29. Rails Integration A Balancer Load ta Da g S H sin es oc Pr“REST is the new JDBC”ActiveRecord backed by REST?Anything more than a proxy?
    30. 30. Ratch Processing (Combing Real-time and Batch)Data Flows as: Cascading Map/Reduce jobs Storm Topologies?Can’t we have one framework to rulethem all?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×