Big Data and APIs - a recon tour on how to successfully do Big Data analytics

  • 834 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
834
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Data & APIs A recon tour on how to successfully do Big Data
  • 2. More events, users Facebook user post 4.5 billion items a day (as of Sep 2013) Facebook MAU 1.2 billion (as of Sep 2013)
  • 3. More messages, transactions WhatsApp From 0 to 31 billion messages sent daily (as of Aug 2013)
  • 4. for { x <- post.stream user <- getUser(x) message <- getData(x) friend <- getFriends(user) } { yield notifyFriend(friend,user,message.id) } 1 billion posts a day! Example: Notify all my friends
  • 5. Pleasingly parallel problems ● ● ● ● ● ●
  • 6. News filtering This is a tougher problem. You cannot read all that stuff !!!
  • 7. News filtering: “a machine feeds you what to read”
  • 8. for { x <- post.stream user <- getUser(x) message <- getData(x) friend <- getFriends(user) hustle <- getFriendNonsense(friend) weather <- getWeather(user) mood <- getMood(user), vibe <- getMood(friend), topics <- getTrendingTopics(friends) market <- getChart(‘gold, ‘bigmac) interesting <- hal9000(hustle,weather,mood,vibe,topics,market) if interesting }{ yield notifyFriend(friend,user,message.id) } 1 billion posts a day! Notify only those who care. The context is much bigger now.
  • 9. Dealing with context
  • 10. Machine learning to the rescue ● ● ● The problem Constraints
  • 11. Data science: random forests from bigml.com Solve a classification problem
  • 12. Million of features. Million of users and preferences. Very large sparse matrix !
  • 13. Data science: Time series prediction Extract features. Correlate time series Very large sparse matrix !
  • 14. RAM: 100 Tera Byte, DISK: 100 Peta Byte, CPU: 100 Tera Flops
  • 15. Bummer.
  • 16. Why?
  • 17. Nature went that way too. Ain’t that funny? “Evolving to multi cellular organisms” More resiliant cells die: organism lives on Complex tasks: cannot be handled by a single cell
  • 18. Distributed parallel problems
  • 19. A few distributed computing paradigms MPI, supercomputing, layered memory arch. , locking, acid homogeneous, simpler model heterogeneous, actor model, state-machine
  • 20. The Map Reduce computing
  • 21. The Map Reduce computing
  • 22. Map-Reduce: How well are we doing?
  • 23. CAP theorem: 12 years later The CAP theorem is largely misunderstood.
  • 24. High Availability A system can be up, but not available (think of a network outage: your system is in P mode) How to improve it . Replication / Redundancy: 3, 5 replicas are common in highly available systems Dynamic Commission - Decommission: re-balance the cluster for dead/new nodes
  • 25. Tuning CAP: understand your use cases
  • 26. Hadoop Distributed FS Haddop Distribute Run-Time (Map-Reduce) Hive (DB) Python R Cassandra (distributed low-latency datastore) Akka (web server, in-memory runtime) A proven stack today: Functional
  • 27. Hadoop Distributed FS Haddop Distribute Run-Time (Map-Reduce) Hive (DB) Python R Cassandra (distributed low-latency datastore) Akka (web server, in-memory runtime) A proven stack today: Monitoring-Logging Atmos DataStax OpsCenter Hue Ambari Ganglia Elastic Search Logstash KibanaMarvel
  • 28. Everything Distributed
  • 29. Latency tradeoffs
  • 30. Hmm, thats a complex system. How to manage?
  • 31. Hmm, thats a complex system. How to manage? lazy evaluated scheduled
  • 32. APIs are everywhere.
  • 33. Thanks