Possible Visions for Mahout 1.0

  • 1,310 views
Uploaded on

These are the slides that we used to ignite the conversation with the audience at Hadoop Summit EU. Come over to the Mahout dev list to be part of the ongoing conversation.

These are the slides that we used to ignite the conversation with the audience at Hadoop Summit EU. Come over to the Mahout dev list to be part of the ongoing conversation.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,310
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
41
Comments
0
Likes
8

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • I just have 5 minutes for this talk. Given the short time I thought I’d share with you some of the more interesting things you can do with Hadoop in 5 minutes or less…
  • In 1 minute you can perform 4.73 million concurrent authentications in the largest biometric database in the worldIn India, there is no social security card. It’s difficult for the average citizen to set up a bank account, access benefit programs, and enjoy economic mobility. It’s difficult for the government as well with over a $1B of government aid classified as leakage, the result of fraud and corruption. The Aadhaar program is poised to change all that by leveraging the unique IDs that all people are born with. The program aims to get fingerprints and retina scans for all 1.2 billion citizens. The scale of this project required an integrated in-Hadoop database that was capable of 200 millisecond response times while supporting millions of concurrent look-ups.
  • In 1 minute you can perform 4.73 million concurrent authentications in the largest biometric database in the worldIn India, there is no social security card. It’s difficult for the average citizen to set up a bank account, access benefit programs, and enjoy economic mobility. It’s difficult for the government as well with over a $1B of government aid classified as leakage, the result of fraud and corruption. The Aadhaar program is poised to change all that by leveraging the unique IDs that all people are born with. The program aims to get fingerprints and retina scans for all 1.2 billion citizens. The scale of this project required an integrated in-Hadoop database that was capable of 200 millisecond response times while supporting millions of concurrent look-ups.

Transcript

  • 1. © 2014 MapR Technologies 1
  • 2. © 2014 MapR Technologies 2
  • 3. © 2014 MapR Technologies 3
  • 4. © 2014 MapR Technologies 4 A typical encounter with a potential Mahout user
  • 5. © 2014 MapR Technologies 5 Which leads us to the Mahout 1.0 vision
  • 6. © 2014 MapR Technologies 6
  • 7. © 2014 MapR Technologies 7
  • 8. © 2014 MapR Technologies 8
  • 9. © 2014 MapR Technologies 9 Example: Cooccurrence Analysis
  • 10. © 2014 MapR Technologies 10 How often do items co-occur? // load distributed matrix val A = drmFromHDFS(...) // compute co-occurrences val C = A.t %*% A
  • 11. © 2014 MapR Technologies 11 How often do items co-occur? // load distributed matrix val A = drmFromHDFS(...) // compute co-occurrences val C = A.t %*% A Under the covers: Optimizer rewrites the matrix multiplication and transpose operations to a TransposeSelf operator Optimizer chooses from two physical operators for TransposeSelf
  • 12. © 2014 MapR Technologies 12 Which items co-occur anomalously? // compute & broadcast number // of interactions per item val numInteractions = drmBroadcast(A.colSums) // create indicator matrix val I = C.mapBlock() { case (keys, block) => // allocate sparse block of indicator matrix val indicatorBlock = sparse(block.nrow, block.ncol) // compute indicators with loglikelihood ratio test for (row <- block) indicatorBlock(row.index,::) = computeLLR(row,numInteractions) keys -> indicatorBlock }
  • 13. © 2014 MapR Technologies 13 Runtime • prototype on Apache Spark – fast and expressive cluster computing system – general computation graphs, in-memory primitives, rich API, interactive shell • future: add Stratosphere – project proposed to Apache Incubator recently – similar to Apache Spark, adds data flow optimization and efficient out- of-core execution
  • 14. © 2014 MapR Technologies 14
  • 15. © 2014 MapR Technologies 15
  • 16. © 2014 MapR Technologies 16 How Does This Apply?
  • 17. © 2014 MapR Technologies 17 How Can I Start?
  • 18. © 2014 MapR Technologies 18 Q&A @ted_dunning @mapr maprtech tdunning@mapr.com Engage with us! MapR maprtech mapr-technologies
  • 19. © 2014 MapR Technologies 20