Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data DC - Analytics at Clearspring

9,745 views

Published on

A brief introduction to Clearspring's approach to Big Data analytics.

Published in: Business, Technology
  • just joined today for new commer.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you are interested in a more detailed explanation of Clearspring's approach to big data processing you can read more about it on our blog page. http://clearspring.com/blog/2011/05/05/clearsprings-big-data-architecture-part-1/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Slides 11 - 20 failed to load
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big Data DC - Analytics at Clearspring

  1. 1. Data Processing<br />at Clearspring<br />Matt Abrams – <br />abramsm @{twitter,clearspring.com}<br />
  2. 2. Clearspring’s Data Source<br />9 million domains<br />1 billion users<br />2.5 billion view events per day<br />250-300 million searches per day<br />3 million shares per day<br />4-5 TB per day<br />1.8 PB per year<br />
  3. 3. I Can Has Analytics?<br /><ul><li>Analytics cluster has hundreds of nodes
  4. 4. Distributed processing system, Hydra, capable of batch and live data processing
  5. 5. Small(ish) Cassandra cluster serves > 200M share counter views daily (growing fast)</li></li></ul><li>Data Stack<br />Products<br />Data Analysis/Query<br />Splitter<br />Tree Builder<br />DFS<br />
  6. 6. Task Breakdown<br />m Hosts<br />Job<br />nTasks<br />Task DB 0-n<br />
  7. 7. Design Philosophy<br />Speed over Safety<br />Simplicity over Complexity<br />At scale small performance delta’s matter<br />Close is good enough in many cases (probabilistic data structures)<br />
  8. 8. Speed over Safety<br />First rule of big data – You will not know what question you wanted to ask until after you’ve already processed the data<br />System is designed to run much faster than real time, we expect to reprocess data, speed provides safety<br />Paying the overhead cost for robustness slows processing down all of the time even though failures are rare<br />Since we can process data faster than real time, it is cheaper to just reprocess when failures occur<br />
  9. 9. Simplicity over Complexity<br />Berkley DB JE<br />Each task has a local in-process tree database<br />Task local data is merged at query time<br />Distributed File System<br />We only need replication a distributed reads (not writes) so we built those two things and created a very Simple DFS<br />
  10. 10. At Scale Small Performance Deltas Matter<br />2.5 Billion Events +- .001s = 29 days of compute time<br />Using binary data formats avoids HashMaps and String tokenization<br />Java String hashCode(), fast but non trivial<br />Array lookups are faster than hash lookups<br />Pre-binding data in to Bundles allows for index lookups<br />Arrays are stored as first class objects rather than delimited strings (avoids string tokenization overhead)<br />
  11. 11. Close is good enough—Probablistic Data Structures<br />Bloom Filters<br />Probabilistic space efficient data structure used to test if a element is a member of a set, with some false positives but no false negatives<br />Probabilistic Counting<br />Linear Time-Probabilistic Counting[1]<br />Adaptive Counting[2]<br />LogLog Counting[3]<br />http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=CF778D0AC35C32ABF85E7DBD47D7697F?doi=10.1.1.82.1074&rep=rep1&type=pdf<br />http://conferences.sigcomm.org/sigcomm/2005/paper-CaiPan.pdf<br />http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CBYQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.79.8821%26rep%3Drep1%26type%3Dpdf&rct=j&q=LogLog%20counting%20of%20large%20cardinalities&ei=q0bFTdu2LuTV0QG18JyBCA&usg=AFQjCNHANMiKJapk2Q7hH6jtrLvCiOcC2w&sig2=oUAUZmzjN9z8tKCksMCUsg<br />
  12. 12. Data Stack<br />Products<br />Data Analysis/Query<br />Splitter<br />Tree Builder<br />DFS<br />
  13. 13. Tree Builder<br />Tree structure maps to indexes<br />Tree databases are sharded and distributed across the cluster<br />Trees can be incrementally updated as new data is received<br />Trees can be queried while we are updating them<br />Path through the tree represents a row in a table<br />Finding and iterating over sparse sets is efficient<br />
  14. 14. Building Trees<br />In-Process Database (BDB)<br />Stores key/value pairs in b-tree structure <br />Log based append only storage system<br />PageDB<br />Groups keys and values together for efficient block-level compression<br />Data is stored in compressed format<br />Application<br />Page DB<br />Berkley DB JE<br />Operating System<br />
  15. 15. Data Stack<br />url1<br />svc1<br />typ1<br />url1<br />New Data<br />pub1<br />svc1<br />typ1<br />url1<br />svc1<br />url1<br />
  16. 16. Data Stack<br />url1<br />svc1<br />typ1<br />url1<br />New Data<br />pub2<br />svc1<br />typ1<br />url1<br />svc1<br />url1<br />
  17. 17. Data Stack<br />url1<br />svc1<br />typ2<br />url1<br />New Data<br />pub2<br />svc1<br />typ1<br />url1<br />svc1<br />url1<br />
  18. 18. Data Stack<br />url1<br />svc1<br />typ2<br />url1<br />New Data<br />pub2<br />svc2<br />typ1<br />url1<br />svc1<br />url1<br />
  19. 19. Data Stack<br />url1<br />svc1<br />typ2<br />url2<br />New Data<br />pub2<br />svc2<br />typ1<br />url1<br />svc1<br />url1<br />
  20. 20. Open Source?<br />Stream-Lib – github.com/clearspring/stream-lib<br />Hydra, Splitter, Stream Server, Flexible Logging, PageDB – not yet, but we want to.<br />
  21. 21. Questions?<br />

×