End-to-end Hadoop for  search suggestions      Guest lecture at RuG
WEB AND CLOUD COMPUTING
http://dilbert.com/strips/comic/2011-01-07/
CLOUD COMPUTING                               FOR                              RENTimage: flickr user jurvetson
30+ million users in       two yearsWith 3 engineers!
Also,
A quick survey...
This talk:Build working search suggestions (like onbol.com)That update automatically based on userbehaviorUsing a lot of o...
http://twitter.github.com/bootstrap/
How do we populate this array?
possibly many of these         HTTP request: /search?q=trisodium+phosphate                                                ...
Search is important! 98% of landing pagevisitors go directly to    the search box
http://www.elasticsearch.org/
http://nodejs.org/
Node.jsServer-side JavaScriptBuilt on top of the V8 JavaScript engineCompletely asynchronous / non-blocking /event-driven
Non-blocking means function calls never block
Distributed filesystemConsists of a single master node and multiple (many) datanodesFiles are split up blocks (64MB by def...
Accessible through Java APIFUSE (filesystem in user space) driver available to mount asregular FSC API availableBasic comm...
File creation, directory listing and other meta data actions gothrough the master node (e.g. ls, du, fsck, create file)Dat...
Name Node                                                          /some/file               /foo/barHDFS client           ...
NameNode daemonFilesystem master nodeKeeps track of directories, files and block locationsAssigns blocks to data nodesKeep...
DataNode daemonFilesystem worker node / “Block server”Uses underlying regular FS for storage (e.g. ext4)Takes care of dist...
HDFS triviaHDFS is write once, read many, but has append support innewer versionsHas built in compression at the block lev...
Job input comes from files on HDFSOther formats are possible; requires specialized InputFormatimplementationBuilt in suppo...
JobTracker daemonMapReduce master nodeTakes care of scheduling and job submissionSplits jobs into tasks (Mappers and Reduc...
TaskTracker daemonMapReduce worker processStarts Mappers en Reducers assigned by JobTrackerSends heart beats to the JobTra...
NameNodeJobTracker DataNode      DataNode      DataNode      DataNodeTaskTracker   TaskTracker   TaskTracker   TaskTracker
“Flume is a distributed, reliable, and available service for   efficiently collecting, aggregating, and moving large       ...
image from http://flume.apache.org
image from http://flume.apache.org
image from http://flume.apache.org
Processing the logs, a simple approach:Divide time into 15 minute bucketsCount # times a search term appears ineach bucket...
MapReduce, Job 1:FIFTEEN_MINUTES = 1000 * 60 * 15map (key, value) {  term = getTermFromLogLine(value)  time = parseTimesta...
MapReduce, Job 2:map (key, value) {  for each prefix of key {    outputKey = prefix    count = getCountFromValue(value)   ...
Processing the logs, a simple solution:Run Job 1 on new input (written by Flume)When finished processing new logs, movefiles...
A database. Sort of... “Redis is an open source, advanced key-value store. It isoften referred to as a data structure serv...
Insert into Redis:Update all keys after each jobSet a TTL on all keys of one hour; obsoleteentries go away automatically
How much history do we need?
Improvements:Use HTTP keep-alive to make sure the tcpstays openCreate a diff in Hadoop and only updaterelevant entries in ...
Options for better algorithms:Use more signals than only search: Did a user type the text or click a suggestion?Look at ca...
Q&A          code is athttps://github.com/friso/rug                         fvanvollenhoven@xebia.com                     ...
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
Upcoming SlideShare
Loading in …5
×

RuG Guest Lecture

1,857
-1

Published on

Presented at a guest lecture at the Rijksuniversiteit Groningen as part of the web and cloud computing master course.

I presented a architecture for and working implementation of doing Hadoop based typeahead style search suggestions. There is a companion github repo with the code and config at: https://github.com/friso/rug (there's no documentation, though).

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,857
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

RuG Guest Lecture

  1. 1. End-to-end Hadoop for search suggestions Guest lecture at RuG
  2. 2. WEB AND CLOUD COMPUTING
  3. 3. http://dilbert.com/strips/comic/2011-01-07/
  4. 4. CLOUD COMPUTING FOR RENTimage: flickr user jurvetson
  5. 5. 30+ million users in two yearsWith 3 engineers!
  6. 6. Also,
  7. 7. A quick survey...
  8. 8. This talk:Build working search suggestions (like onbol.com)That update automatically based on userbehaviorUsing a lot of open source (elasticsearch,Hadoop, Redis, Bootstrap, Flume, node.js,jquery)With a code example, not just slidesAnd a live demo of the working thing inaction
  9. 9. http://twitter.github.com/bootstrap/
  10. 10. How do we populate this array?
  11. 11. possibly many of these HTTP request: /search?q=trisodium+phosphate write to log file: 2012-07-01T06:00:02.500Z /search?q= trisodium%20phosphate serve suggestions for web requestsUSER transport logs to compute cluster push suggestions to DB server this thing must be possiblyreally, really terabytes of transform logs to search suggestions fast these
  12. 12. Search is important! 98% of landing pagevisitors go directly to the search box
  13. 13. http://www.elasticsearch.org/
  14. 14. http://nodejs.org/
  15. 15. Node.jsServer-side JavaScriptBuilt on top of the V8 JavaScript engineCompletely asynchronous / non-blocking /event-driven
  16. 16. Non-blocking means function calls never block
  17. 17. Distributed filesystemConsists of a single master node and multiple (many) datanodesFiles are split up blocks (64MB by default)Blocks are spread across data nodes in the clusterEach block is replicated multiple times to different data nodes inthe cluster (typically 3 times)Master node keeps track of which blocks belong to a file
  18. 18. Accessible through Java APIFUSE (filesystem in user space) driver available to mount asregular FSC API availableBasic command line tools in Hadoop distributionWeb interface
  19. 19. File creation, directory listing and other meta data actions gothrough the master node (e.g. ls, du, fsck, create file)Data goes directly to and from data nodes (read, write, append)Local read path optimization: clients located on same machineas data node will always access local replica when possible
  20. 20. Name Node /some/file /foo/barHDFS client create file read data Date Node Date Node Date Node write data DISK DISK DISK Node local HDFS client DISK DISK DISK replicate DISK DISK DISK read data
  21. 21. NameNode daemonFilesystem master nodeKeeps track of directories, files and block locationsAssigns blocks to data nodesKeeps track of live nodes (through heartbeats)Initiates re-replication in case of data node lossBlock meta data is held in memoryWill run out of memory when too many files exist
  22. 22. DataNode daemonFilesystem worker node / “Block server”Uses underlying regular FS for storage (e.g. ext4)Takes care of distribution of blocks across disksDon’t use RAIDMore disks means more IO throughputSends heartbeats to NameNodeReports blocks to NameNode (on startup)Does not know about the rest of the cluster (shared nothing)
  23. 23. HDFS triviaHDFS is write once, read many, but has append support innewer versionsHas built in compression at the block levelDoes end-to-end checksumming on all data (on read andperiodically)HDFS is best used for large, unstructured volumes of raw datain BIG files used for batch operationsOptimized for sequential reads, not random access
  24. 24. Job input comes from files on HDFSOther formats are possible; requires specialized InputFormatimplementationBuilt in support for text files (convenient for logs, csv, etc.)Files must be splittable for parallelization to workNot all compression formats have this property (e.g. gzip)
  25. 25. JobTracker daemonMapReduce master nodeTakes care of scheduling and job submissionSplits jobs into tasks (Mappers and Reducers)Assigns tasks to worker nodesReassigns tasks in case of failureKeeps track of job progressKeeps track of worker nodes through heartbeats
  26. 26. TaskTracker daemonMapReduce worker processStarts Mappers en Reducers assigned by JobTrackerSends heart beats to the JobTrackerSends task progress to the JobTrackerDoes not know about the rest of the cluster (shared nothing)
  27. 27. NameNodeJobTracker DataNode DataNode DataNode DataNodeTaskTracker TaskTracker TaskTracker TaskTracker
  28. 28. “Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” http://flume.apache.org/
  29. 29. image from http://flume.apache.org
  30. 30. image from http://flume.apache.org
  31. 31. image from http://flume.apache.org
  32. 32. Processing the logs, a simple approach:Divide time into 15 minute bucketsCount # times a search term appears ineach bucketDivide counts by (current time - time ofstart of bucket)Rank terms according to sum of dividedcounts
  33. 33. MapReduce, Job 1:FIFTEEN_MINUTES = 1000 * 60 * 15map (key, value) { term = getTermFromLogLine(value) time = parseTimestampFromLogLine(value) bucket = time - (time % FIFTEEN_MINUTES) outputKey = (bucket, term) outputValue = 1 emit(outputKey, outputValue)}reduce(key, values) { outputKey = getTermFromKey(key) outputValue = (getBucketFromKey(key), sum(values)) emit(outputKey, outputValue)}
  34. 34. MapReduce, Job 2:map (key, value) { for each prefix of key { outputKey = prefix count = getCountFromValue(value) score = count / (now - getBucketFromValue(value)) }}reduce(key, values) { for each term in values { storeOrUpdateTermScore(value) } order terms by descending scores emit(key, top 10 terms)}
  35. 35. Processing the logs, a simple solution:Run Job 1 on new input (written by Flume)When finished processing new logs, movefiles to another locationRun Job 2 after Job 1 on new and historicoutput of Job 1Push output of Job 2 to a databaseCleanup historic outputs of Job 1 after Ndays
  36. 36. A database. Sort of... “Redis is an open source, advanced key-value store. It isoften referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.” http://redis.io/
  37. 37. Insert into Redis:Update all keys after each jobSet a TTL on all keys of one hour; obsoleteentries go away automatically
  38. 38. How much history do we need?
  39. 39. Improvements:Use HTTP keep-alive to make sure the tcpstays openCreate a diff in Hadoop and only updaterelevant entries in RedisProduction hardening (error handling,packaging, deployment, etc.)
  40. 40. Options for better algorithms:Use more signals than only search: Did a user type the text or click a suggestion?Look at cause-effect relationships withinsessions: Was it a successful search? Did the user buy the product?Personalize: Filtering of irrelevant results Re-ranking based on personal behavior
  41. 41. Q&A code is athttps://github.com/friso/rug fvanvollenhoven@xebia.com @fzk

×