Web-scale data processing: practical approaches for low-latency and batch


Published on

Web-scale data processing: practical
approaches for low-latency and batch

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • {}
  • Web-scale data processing: practical approaches for low-latency and batch

    1. 1. $>whoami Edward Capriolo ● Developer @ dstillery (the company formally known as m6d aka media6degrees) ● Hive: Project Management Committee ● Hadoop'in it since 0.17.2 ● Cassandra-'in it since 0.6.X ● Hive'in it 0.3.X ● Incredibly skilled with power point
    2. 2. Agenda for this talk ● Batch processing via Hadoop ● Stream processing ● Relational Databases and NoSQL ● Life lessons, quips, and other prospective
    3. 3. Before we talk tech... ● ● ● ● Lets talk math! Yay! math fun! (as people start leaving room) Don't worry. It is only a couple slides. Wanted to talk about relational algebra since it is the foundation of relation databases Even in the NoSQL age, relational algebra is alive and well
    4. 4. Relational algebra... A big slide with many words ● ● Relational algebra received little attention outside of pure mathematics until the publication of E.F. Codd's relational model of data in 1970. Codd proposed such an algebra as a basis for database query languages. In computer science, relational algebra is an offshoot of first-order logic and of algebra of sets concerned with operations over finitary relations, usually made more convenient to work with by identifying the components of a tuple by a name (called attribute) rather than by a numeric column index, which is called a relation in database terminology. http://en.wikipedia.org/wiki/Relational_algebra
    5. 5. Operators of Relational algebra:
    6. 6. Projection ● SELECT Age, Weight ... Extended projections ● SELECT Age+Weight as X ... ● SELECT ROUND(Weight),Age+1 as X ...
    7. 7. Selection ● SELECT * FROM Person ● SELECT * FROM Person WHERE Age >=34 ● SELECT * FROM Person WHERE Age = Weight
    8. 8. Joins ● ● SELECT * FROM Car JOIN Boat on (CarPrice >= BoatPrice) SELECT * FROM Car JOIN Boat on (CarPrice = BoatPrice)
    9. 9. Aggregate ● SELECT sum(C) FROM r ● SELECT A, sum(C) FROM r GROUP BY A http://www.cbcb.umd.edu/confcour/CMSC424/Relational_algebra.pdf
    10. 10. Other Operators ● Set operations – – Intersection – ● Union Cartesian Product Outer joins – – RIGHT, – ● LEFT FULL Semi Join / Exists
    11. 11. Batch Processing and Big Data ● When hadoop game on the scene it was a game changer because: – Viable implementation of Google's map reduce white paper – Worked with commodity hardware – Had no exuberant software fees – Scaled processing and storage with growing companies without typically needed processes to be redesigned
    12. 12. Archetype Hadoop deployment (circa facebook 2009) Scribe Writers Realtime Hadoop Cluster Web Servers Scribe MidTier Oracle RAC Hadoop Hive Warehouse MySQL http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
    13. 13. The Hadoop archetype ● ● ● ● Component generating events (web servers) Component collecting logs into hadoop (scribe) Translation of raw data using hadoop and hive Output of rollups to oracle and other data systems – feedback loops (mysql <-> hive)
    14. 14. Use case: Book store ● Our book store will be named (say it with me!): – – Big Data, – No SQL, – Real Time Analytics, – ● Web scale, Books! One more time! – Web scale, Big Data, No SQL, Real Time Analytics, Books ● (A buzzword bingo company)
    15. 15. Domain model { "id":"00001", "refer":"http://affiliate1.superbooks.com", "ip":"", "status":"ACCEPTED", "eventTimeInMillis":1383011801439, "credit_hash":"ab45de21", "email":"bob@compuserv.com", "purchases":[ { "name":"Programming Hive", "cost":30.0 }, { "name":"frAgile Software Development", "cost":0.2 } ] }
    16. 16. Complex serialized payloads ● ● ● “process web logs” in facebook's case were NOT always tab delimited text files In many cases scribe was logging complex structures in thrift format Hadoop (and hive) can work with complex records not typical in RDBMS
    17. 17. Log collection/ingestion http://flume.apache.org/FlumeUserGuide.html
    18. 18. Several ingestion approaches ● Scribe never took off ● Choctaw (hangs around not sexy) ● Log servers log direct with HDFS API ● Duck taped up set of shell scripts ● Flume seems to be the most widely used, feature rich, and supported system
    19. 19. Left up to the user... ● What format do you want the raw data in ● How should the data be staged in HDFS – – ● hourly directories by host How to monitor – Semantics of what the pipeline should do if files stop appearing? – Application specific sanity checks
    20. 20. Unleash the hounds!
    21. 21. Hive and relational algebra ● SELECT refer, sum(purchase.cost) FROM store_transaction <- Projection <- Aggregation LATERAL VIEW explode (purchase) plist as purchase <- Hive sexyness <- Aggregation GROUP BY refer WHERE refer = 'y' <- Selection
    22. 22. Hadoop/Hive's parallel implementation
    23. 23. Drawbacks of the batch approach ● Not efficient/possible on small time windows – ● Jobs have start up time and over head Late data can be troublesome – Resulting in full rerun – Re-run of dependent jobs ● Failures can set processing hours back (or maybe days ● Scheduling of dependent tasks – Not a huge consensus around proper tool ● ● ● Oozie Azcaban Cron ... pause not
    24. 24. More drawbacks of Batch data ● Interactive analysis of results ● Detecting sanity of input ● ● Result data typically moved into other systems for interactive analysis (post process) Most computational steps spill/persist to disk – Components of a job can be pipelined but between two jobs is persistent storage. That needs to be re-read in for next batch.
    25. 25. Stream Processing
    26. 26. Stream processing ● My first job “stream processing” reading in Associated Press data – – ● ● Connecting to a terminal server connected to a serial modem Writing this information to a database My definition: Processing data across one or more coordinated data channels Like “Big Data”, Stream Processing is: – Whatever you say it is
    27. 27. Common components of stream processing ● ● ● Message Queue – A system that delivers a never ending stream of data Processing engine – Manages streams and connects data to processing External/Internal persistence – Some data may live outside the stream. – It could be transient or persistent
    28. 28. Message Queues
    29. 29. Why most Message Queue software does not 'scale' ● MQ 'guarantees' ● ● ● MQ Typically optimize by keeping all data in memory – Semantics around what happens when memory is full ● ● ● ● In order delivery Acknowledgments Block Persist to disk Throw away Not trashing Messages Queues here. Many of their guarantees are hard to deliver at scale, and not always needed
    30. 30. Kafka – A high-throughput distributed messaging system A publish-subscribe messaging re-thought as a distributed commit log
    31. 31. Distributed ● Data streams are partitioned and spread over a cluster of machines
    32. 32. Durable and fast ● Messages are always persisted to disk! ● Consumers track their position in log files ● Kafka uses the sendfile system call for performance
    33. 33. Consumer Groups ● ● Multiple groups can subscribe to an event stream Producers can determine event partitioning
    34. 34. Great! You have streaming data. How do you process it? ● Storm - https://github.com/nathanmarz/storm ● Samza - samza.incubator.apache.org ● S4 - http://incubator.apache.org/s4/ ● http://www03.ibm.com/software/products/us/en/infospherestreams/ Heck even I wrote one! ● IronCount https://github.com/edwardcapriolo/IronCount
    35. 35. Before you have a holy war over this software decision...
    36. 36. Storm ● Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC.
    37. 37. Storm (Trident) API ● Data comes from spouts ● Spouts/streams produce tuples ● FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 1, new Values("line one"), new Values("line two")); https://github.com/nathanmarz/storm/wiki/Trident-tutorial
    38. 38. (extended) Projection ● Stream can be processed into another stream ● Here a line is split into words ● ● Stream words = stream.each(new Fields("sentence"), new Split(), new Fields("word")); (Similar to hive's LATERAL VIEW)
    39. 39. Grouping and Aggregation ● ● GroupedStream groupByWord = words.groupBy( new Fields("word")); TridentState groupByState = groupByWord.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"));
    40. 40. Great! We just did distributed stream processing! ● ● But where is the results? groupByWord.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ● In Memory... aka nowhere :) ● We can change that... ● But first some math/science/dribble I stole from wikipedia in an attempt to sound smart!
    41. 41. Temporal database ● ● A temporal database is a database with built-in support for handling data involving time, for example a temporal data model and a temporal version of Structured Query Language (SQL). Temporal databases are in contrast to current databases, which store only facts which are believed to be true at the current time
    42. 42. Batch/Hadoop was easy (temporaly speaking) ● Input data is typically in write-once hdfs files* ● Output data typically to write-once output files* ● ● ● Reduce phase does not start until map/shuffle is done Output data typically available until the entire job is done* Idempotent computation *Going to qualify everything with typically, because of computational idempotency
    43. 43. The real “real time” ● Real time is often misused ● Anecdotally people usually mean – – ● Low latency Small windows of time (sub-minute & sub-second) Our bookstore wants “real time” stats – ● aggegations and data stores updated incrementally as data is processed One way to implement this is discrete columns bucketed by time
    44. 44. Tempor-alizing data ● ● ● ● In an earlier example we aggegated revenue by referrer like this: SELECT refer, sum(purchase.cost) ... GROUP BY refer Now we include the time: SELECT date(eventtime),hour(eventtime), minute(eventtime) refer, sum(purchase.cost) GROUP BY day(eventtime),hour(eventtime), minute(eventtime)
    45. 45. Storing data in Cassandra ● Horizontally scalable (hundreds of nodes) ● No single point of failure ● Integrated replication ● Writes like lightning (Structured log storage) ● Reads like thunder (LevelDB & BigTable inspired storage)
    46. 46. Scalable time series made easy with cassandra ● ● ● Create a table with one row per day per refer, sorted by time CREATE TABLE purchase_by_refer ( refer text, dt date, event_time timestamp, tot counter, PRIMARY KEY ((refer,dt),event_time)); UPDATE purchase_by_refer set tot=tot+1 where refer = 'store1 and dt='2013-01-12' and event_time=''2013-01-12 07:03:00'
    47. 47. If you want c* and storm ● ● ● https://github.com/hmsonline/storm-cassandra Uses Cassandra as a peristance model for storm Good documentation
    48. 48. The home stretch: Joining streams and caching data ● ● ● Some use cases of distributed streaming involve keeping local caches Streaming algorithms requires memory of recent events and do not want to query a datastore each time an event is received Kafka is useful in this case because the user can dictated the partition the data is sent to
    49. 49. Streaming Recommendation System https://github.com/edwardcapriolo/IronCount
    50. 50. Input Streams Stream 1: users Stream 2: items user|1:edward cart|1:saw:2.00 user|2:nate cart|1:hammer:3.00 user|3:stacey cart|3:puppy:1.00 ● Both streams merged (union) ● The field after the pipe is the userid (projection) ● User id should be the partition key when sent on (aggregation)
    51. 51. Handle message and route by id public void handleMessage(MessageAndMetadata <Message> m) { String line = getMessage(m.message()); String[] parts = line.split("|"); String table = parts[0]; String row = parts[1]; String [] columns = row.split(":"); producer.send(new ProducerData<String, String> ("reduce", columns[0], Arrays.asList(table+"|"+row))); }
    52. 52. Update in memory copy ● public class ReduceHandler implements MessageHandler { HashMap<User,ArrayList<Item>> data = new EvictingHashMap<User,ArrayList<Item>>(); ... public void handleMessage (MessageAndMetadata<Message> m) { if ( table.equals("cart")){ Item i = new Item(); i.parse(columns); incrementItemCounter(u); incrementDollarByUser(u,i); } suggestNewItemsForUser(u);
    53. 53. Challenges of streaming ● Replay of data could double/miss count ● New evolving API's – ● ● ● You may have to build support for your stack Distributed computation is harder to log/debug Monitoring consumption on topics to avoid falling behind Monitoring topics to notice if data stops
    54. 54. El fin