Big Data and APIs - a recon tour on how to successfully do Big Data analytics

1,462 views
1,308 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,462
On SlideShare
0
From Embeds
0
Number of Embeds
590
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Big Data and APIs - a recon tour on how to successfully do Big Data analytics

  1. 1. Big Data & APIs A recon tour on how to successfully do Big Data
  2. 2. More events, users Facebook user post 4.5 billion items a day (as of Sep 2013) Facebook MAU 1.2 billion (as of Sep 2013)
  3. 3. More messages, transactions WhatsApp From 0 to 31 billion messages sent daily (as of Aug 2013)
  4. 4. for { x <- post.stream user <- getUser(x) message <- getData(x) friend <- getFriends(user) } { yield notifyFriend(friend,user,message.id) } 1 billion posts a day! Example: Notify all my friends
  5. 5. Pleasingly parallel problems ● ● ● ● ● ●
  6. 6. News filtering This is a tougher problem. You cannot read all that stuff !!!
  7. 7. News filtering: “a machine feeds you what to read”
  8. 8. for { x <- post.stream user <- getUser(x) message <- getData(x) friend <- getFriends(user) hustle <- getFriendNonsense(friend) weather <- getWeather(user) mood <- getMood(user), vibe <- getMood(friend), topics <- getTrendingTopics(friends) market <- getChart(‘gold, ‘bigmac) interesting <- hal9000(hustle,weather,mood,vibe,topics,market) if interesting }{ yield notifyFriend(friend,user,message.id) } 1 billion posts a day! Notify only those who care. The context is much bigger now.
  9. 9. Dealing with context
  10. 10. Machine learning to the rescue ● ● ● The problem Constraints
  11. 11. Data science: random forests from bigml.com Solve a classification problem
  12. 12. Million of features. Million of users and preferences. Very large sparse matrix !
  13. 13. Data science: Time series prediction Extract features. Correlate time series Very large sparse matrix !
  14. 14. RAM: 100 Tera Byte, DISK: 100 Peta Byte, CPU: 100 Tera Flops
  15. 15. Bummer.
  16. 16. Why?
  17. 17. Nature went that way too. Ain’t that funny? “Evolving to multi cellular organisms” More resiliant cells die: organism lives on Complex tasks: cannot be handled by a single cell
  18. 18. Distributed parallel problems
  19. 19. A few distributed computing paradigms MPI, supercomputing, layered memory arch. , locking, acid homogeneous, simpler model heterogeneous, actor model, state-machine
  20. 20. The Map Reduce computing
  21. 21. The Map Reduce computing
  22. 22. Map-Reduce: How well are we doing?
  23. 23. CAP theorem: 12 years later The CAP theorem is largely misunderstood.
  24. 24. High Availability A system can be up, but not available (think of a network outage: your system is in P mode) How to improve it . Replication / Redundancy: 3, 5 replicas are common in highly available systems Dynamic Commission - Decommission: re-balance the cluster for dead/new nodes
  25. 25. Tuning CAP: understand your use cases
  26. 26. Hadoop Distributed FS Haddop Distribute Run-Time (Map-Reduce) Hive (DB) Python R Cassandra (distributed low-latency datastore) Akka (web server, in-memory runtime) A proven stack today: Functional
  27. 27. Hadoop Distributed FS Haddop Distribute Run-Time (Map-Reduce) Hive (DB) Python R Cassandra (distributed low-latency datastore) Akka (web server, in-memory runtime) A proven stack today: Monitoring-Logging Atmos DataStax OpsCenter Hue Ambari Ganglia Elastic Search Logstash KibanaMarvel
  28. 28. Everything Distributed
  29. 29. Latency tradeoffs
  30. 30. Hmm, thats a complex system. How to manage?
  31. 31. Hmm, thats a complex system. How to manage? lazy evaluated scheduled
  32. 32. APIs are everywhere.
  33. 33. Thanks

×