Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data & APIs
A recon tour on how to successfully do Big Data
More events, users
Facebook user post 4.5 billion
items a day (as of Sep 2013)
Facebook MAU 1.2 billion
(as of Sep 2013)
More messages, transactions
WhatsApp
From 0 to 31 billion
messages sent daily
(as of Aug 2013)
for {
x <- post.stream
user <- getUser(x)
message <- getData(x)
friend <- getFriends(user)
} {
yield notifyFriend(friend,u...
Pleasingly parallel problems
●
●
●
●
●
●
News filtering
This is a tougher problem.
You cannot read
all that stuff !!!
News filtering:
“a machine feeds you what to read”
for {
x <- post.stream
user <- getUser(x)
message <- getData(x)
friend <- getFriends(user)
hustle <- getFriendNonsense(fri...
Dealing with context
Machine learning to the rescue
●
●
●
The problem
Constraints
Data science: random forests
from bigml.com
Solve a
classification
problem
Million of features.
Million of users and
preferences.
Very large
sparse matrix !
Data science: Time series prediction
Extract features.
Correlate time
series
Very large sparse
matrix !
RAM: 100 Tera Byte, DISK: 100 Peta Byte, CPU: 100 Tera Flops
Bummer.
Why?
Nature went that way too.
Ain’t that funny?
“Evolving to multi cellular
organisms”
More resiliant
cells die: organism live...
Distributed parallel problems
A few distributed computing paradigms
MPI, supercomputing, layered memory arch. , locking, acid
homogeneous, simpler model...
The Map Reduce computing
The Map Reduce computing
Map-Reduce: How well are we doing?
CAP theorem: 12 years later
The CAP theorem is largely misunderstood.
High Availability
A system can be up, but not available
(think of a network outage: your system is in P mode)
How to impro...
Tuning CAP: understand your use cases
Hadoop Distributed FS
Haddop Distribute Run-Time (Map-Reduce)
Hive (DB) Python R
Cassandra (distributed low-latency datast...
Hadoop Distributed FS
Haddop Distribute Run-Time (Map-Reduce)
Hive (DB) Python R
Cassandra (distributed low-latency datast...
Everything Distributed
Latency tradeoffs
Hmm, thats a complex system.
How to manage?
Hmm, thats a complex system.
How to manage?
lazy evaluated
scheduled
APIs are everywhere.
Thanks
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Upcoming SlideShare
Loading in …5
×

Big Data and APIs - a recon tour on how to successfully do Big Data analytics

1,619 views

Published on

Published in: Technology
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big Data and APIs - a recon tour on how to successfully do Big Data analytics

  1. 1. Big Data & APIs A recon tour on how to successfully do Big Data
  2. 2. More events, users Facebook user post 4.5 billion items a day (as of Sep 2013) Facebook MAU 1.2 billion (as of Sep 2013)
  3. 3. More messages, transactions WhatsApp From 0 to 31 billion messages sent daily (as of Aug 2013)
  4. 4. for { x <- post.stream user <- getUser(x) message <- getData(x) friend <- getFriends(user) } { yield notifyFriend(friend,user,message.id) } 1 billion posts a day! Example: Notify all my friends
  5. 5. Pleasingly parallel problems ● ● ● ● ● ●
  6. 6. News filtering This is a tougher problem. You cannot read all that stuff !!!
  7. 7. News filtering: “a machine feeds you what to read”
  8. 8. for { x <- post.stream user <- getUser(x) message <- getData(x) friend <- getFriends(user) hustle <- getFriendNonsense(friend) weather <- getWeather(user) mood <- getMood(user), vibe <- getMood(friend), topics <- getTrendingTopics(friends) market <- getChart(‘gold, ‘bigmac) interesting <- hal9000(hustle,weather,mood,vibe,topics,market) if interesting }{ yield notifyFriend(friend,user,message.id) } 1 billion posts a day! Notify only those who care. The context is much bigger now.
  9. 9. Dealing with context
  10. 10. Machine learning to the rescue ● ● ● The problem Constraints
  11. 11. Data science: random forests from bigml.com Solve a classification problem
  12. 12. Million of features. Million of users and preferences. Very large sparse matrix !
  13. 13. Data science: Time series prediction Extract features. Correlate time series Very large sparse matrix !
  14. 14. RAM: 100 Tera Byte, DISK: 100 Peta Byte, CPU: 100 Tera Flops
  15. 15. Bummer.
  16. 16. Why?
  17. 17. Nature went that way too. Ain’t that funny? “Evolving to multi cellular organisms” More resiliant cells die: organism lives on Complex tasks: cannot be handled by a single cell
  18. 18. Distributed parallel problems
  19. 19. A few distributed computing paradigms MPI, supercomputing, layered memory arch. , locking, acid homogeneous, simpler model heterogeneous, actor model, state-machine
  20. 20. The Map Reduce computing
  21. 21. The Map Reduce computing
  22. 22. Map-Reduce: How well are we doing?
  23. 23. CAP theorem: 12 years later The CAP theorem is largely misunderstood.
  24. 24. High Availability A system can be up, but not available (think of a network outage: your system is in P mode) How to improve it . Replication / Redundancy: 3, 5 replicas are common in highly available systems Dynamic Commission - Decommission: re-balance the cluster for dead/new nodes
  25. 25. Tuning CAP: understand your use cases
  26. 26. Hadoop Distributed FS Haddop Distribute Run-Time (Map-Reduce) Hive (DB) Python R Cassandra (distributed low-latency datastore) Akka (web server, in-memory runtime) A proven stack today: Functional
  27. 27. Hadoop Distributed FS Haddop Distribute Run-Time (Map-Reduce) Hive (DB) Python R Cassandra (distributed low-latency datastore) Akka (web server, in-memory runtime) A proven stack today: Monitoring-Logging Atmos DataStax OpsCenter Hue Ambari Ganglia Elastic Search Logstash KibanaMarvel
  28. 28. Everything Distributed
  29. 29. Latency tradeoffs
  30. 30. Hmm, thats a complex system. How to manage?
  31. 31. Hmm, thats a complex system. How to manage? lazy evaluated scheduled
  32. 32. APIs are everywhere.
  33. 33. Thanks

×