Amazon-style
shopping cart analysis
using MapReduce
on a Hadoop cluster

Dan Şerban
Agenda
:: Introduction
- Real-world uses of MapReduce
- The origins of Hadoop
- Hadoop facts and architecture
:: Part 1
- ...
Why shopping cart analysis
is useful to amazon.com
Linkedin and Google Reader
The origins of Hadoop
:: Hadoop got its start in Nutch
:: A few enthusiastic developers were attempting
to build an open s...
Hadoop facts
:: Hadoop is a distributed computing platform
for processing extremely large amounts of data
:: Hadoop is div...
Hadoop facts
:: Hadoop is based on the principle
of moving computation to where the data is
:: Data stored on the Hadoop D...
Hadoop architecture
Part 1:
Configuring and deploying
the Hadoop cluster
Hands-on with Hadoop
core-site.xml - before
core-site.xml - after
hdfs-site.xml - before
hdfs-site.xml - after
mapred-site.xml - before
mapred-site.xml - after
Setting up SSH
:: hadoop@hadoop-master needs to be able to ssh* into:
- hadoop@hadoop-master
- hadoop@chunkserver-a
- hado...
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Part 2:
MapReduce
is machine learning
Rolling your own
self-hosted alternative to ...
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
mapper.py
#!/usr/bin/python
import sys
for line in sys.stdin:
line = line.strip()
IDs = line.split()
for firstID in IDs:
f...
reducer.py
#!/usr/bin/python
import sys
subTotals = {}
for line in sys.stdin:
line = line.strip()
word = line.split('t')[0...
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
Other MapReduce use cases
::
::
::
::
::
::
::
::
::
::

Google Suggest
Video recommendations (YouTube)
ClickStream Analys...
Questions / Feedback
Bonus slide: Making of SQLite DB
Upcoming SlideShare
Loading in...5
×

Prezentare hadoop-100816052218-phpapp01

70

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
70
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Prezentare hadoop-100816052218-phpapp01

  1. 1. Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster Dan Şerban
  2. 2. Agenda :: Introduction - Real-world uses of MapReduce - The origins of Hadoop - Hadoop facts and architecture :: Part 1 - Deploying Hadoop :: Part 2 - MapReduce is machine learning :: Q&A
  3. 3. Why shopping cart analysis is useful to amazon.com
  4. 4. Linkedin and Google Reader
  5. 5. The origins of Hadoop :: Hadoop got its start in Nutch :: A few enthusiastic developers were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers :: Once Google published their GoogleFS and MapReduce whitepapers, the way forward became clear :: Google had devised systems to solve precisely the problems the Nutch project was facing :: Thus, Hadoop was born
  6. 6. Hadoop facts :: Hadoop is a distributed computing platform for processing extremely large amounts of data :: Hadoop is divided into two main components: - the MapReduce runtime - the Hadoop Distributed File System (HDFS) :: The MapReduce runtime allows the user to submit MapReduce jobs :: The HDFS is a distributed file system that provides a logical interface for persistent and redundant storage of large data :: Hadoop also provides the HadoopStreaming library that leverages STDIN and STDOUT so you can write mappers and reducers in your programming language of choice
  7. 7. Hadoop facts :: Hadoop is based on the principle of moving computation to where the data is :: Data stored on the Hadoop Distributed File System is broken up into chunks and replicated across the cluster providing fault tolerant parallel processing and redundancy for both the data and the jobs :: Computation takes the form of a job which consists of a map phase and a reduce phase :: Data is initially processed by map functions which run in parallel across the cluster :: Map output is in the form of key-value pairs :: The reduce phase then aggregates the map results :: The reduce phase typically happens in multiple consecutive waves until the job is complete
  8. 8. Hadoop architecture
  9. 9. Part 1: Configuring and deploying the Hadoop cluster
  10. 10. Hands-on with Hadoop
  11. 11. core-site.xml - before
  12. 12. core-site.xml - after
  13. 13. hdfs-site.xml - before
  14. 14. hdfs-site.xml - after
  15. 15. mapred-site.xml - before
  16. 16. mapred-site.xml - after
  17. 17. Setting up SSH :: hadoop@hadoop-master needs to be able to ssh* into: - hadoop@hadoop-master - hadoop@chunkserver-a - hadoop@chunkserver-b - hadoop@chunkserver-c :: hadoop@job-tracker needs to be able to ssh* into: - hadoop@job-tracker - hadoop@chunkserver-a - hadoop@chunkserver-b - hadoop@chunkserver-c *Passwordless-ly and passphraseless-ly
  18. 18. Hands-on with Hadoop
  19. 19. Hands-on with Hadoop
  20. 20. Hands-on with Hadoop
  21. 21. Hands-on with Hadoop
  22. 22. Hands-on with Hadoop
  23. 23. Hands-on with Hadoop
  24. 24. Hands-on with Hadoop
  25. 25. Part 2: MapReduce is machine learning
  26. 26. Rolling your own self-hosted alternative to ...
  27. 27. Hands-on with MapReduce
  28. 28. Hands-on with MapReduce
  29. 29. Hands-on with MapReduce
  30. 30. mapper.py #!/usr/bin/python import sys for line in sys.stdin: line = line.strip() IDs = line.split() for firstID in IDs: for secondID in IDs: if secondID > firstID: print '%s_%st%s' % (firstID, secondID, 1)
  31. 31. reducer.py #!/usr/bin/python import sys subTotals = {} for line in sys.stdin: line = line.strip() word = line.split('t')[0] count = int(line.split('t')[1]) subTotals[word] = subTotals.get(word, 0) + count for k, v in subTotals.items(): print "%st%s" % (k, v)
  32. 32. Hands-on with MapReduce
  33. 33. Hands-on with MapReduce
  34. 34. Hands-on with MapReduce
  35. 35. Hands-on with MapReduce
  36. 36. Hands-on with MapReduce
  37. 37. Other MapReduce use cases :: :: :: :: :: :: :: :: :: :: Google Suggest Video recommendations (YouTube) ClickStream Analysis (large web properties) Spam filtering and contextual advertising (Yahoo) Fraud detection (eBay, CC companies) Firewall log analysis to discover exfiltration and other undesirable (possibly malware-related) activity Finding patterns in social data, analyzing “likes” and building a search engine on top of them (FaceBook) Discovering microblogging trends and opinion leaders, analyzing who follows who (Twitter) Plain old supermarket shopping basket analysis The semantic web
  38. 38. Questions / Feedback
  39. 39. Bonus slide: Making of SQLite DB
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×