Your SlideShare is downloading. ×
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
RuG Guest Lecture
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

RuG Guest Lecture

1,748

Published on

Presented at a guest lecture at the Rijksuniversiteit Groningen as part of the web and cloud computing master course. …

Presented at a guest lecture at the Rijksuniversiteit Groningen as part of the web and cloud computing master course.

I presented a architecture for and working implementation of doing Hadoop based typeahead style search suggestions. There is a companion github repo with the code and config at: https://github.com/friso/rug (there's no documentation, though).

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,748
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. End-to-end Hadoop for search suggestions Guest lecture at RuG
  • 2. WEB AND CLOUD COMPUTING
  • 3. http://dilbert.com/strips/comic/2011-01-07/
  • 4. CLOUD COMPUTING FOR RENTimage: flickr user jurvetson
  • 5. 30+ million users in two yearsWith 3 engineers!
  • 6. Also,
  • 7. A quick survey...
  • 8. This talk:Build working search suggestions (like onbol.com)That update automatically based on userbehaviorUsing a lot of open source (elasticsearch,Hadoop, Redis, Bootstrap, Flume, node.js,jquery)With a code example, not just slidesAnd a live demo of the working thing inaction
  • 9. http://twitter.github.com/bootstrap/
  • 10. How do we populate this array?
  • 11. possibly many of these HTTP request: /search?q=trisodium+phosphate write to log file: 2012-07-01T06:00:02.500Z /search?q= trisodium%20phosphate serve suggestions for web requestsUSER transport logs to compute cluster push suggestions to DB server this thing must be possiblyreally, really terabytes of transform logs to search suggestions fast these
  • 12. Search is important! 98% of landing pagevisitors go directly to the search box
  • 13. http://www.elasticsearch.org/
  • 14. http://nodejs.org/
  • 15. Node.jsServer-side JavaScriptBuilt on top of the V8 JavaScript engineCompletely asynchronous / non-blocking /event-driven
  • 16. Non-blocking means function calls never block
  • 17. Distributed filesystemConsists of a single master node and multiple (many) datanodesFiles are split up blocks (64MB by default)Blocks are spread across data nodes in the clusterEach block is replicated multiple times to different data nodes inthe cluster (typically 3 times)Master node keeps track of which blocks belong to a file
  • 18. Accessible through Java APIFUSE (filesystem in user space) driver available to mount asregular FSC API availableBasic command line tools in Hadoop distributionWeb interface
  • 19. File creation, directory listing and other meta data actions gothrough the master node (e.g. ls, du, fsck, create file)Data goes directly to and from data nodes (read, write, append)Local read path optimization: clients located on same machineas data node will always access local replica when possible
  • 20. Name Node /some/file /foo/barHDFS client create file read data Date Node Date Node Date Node write data DISK DISK DISK Node local HDFS client DISK DISK DISK replicate DISK DISK DISK read data
  • 21. NameNode daemonFilesystem master nodeKeeps track of directories, files and block locationsAssigns blocks to data nodesKeeps track of live nodes (through heartbeats)Initiates re-replication in case of data node lossBlock meta data is held in memoryWill run out of memory when too many files exist
  • 22. DataNode daemonFilesystem worker node / “Block server”Uses underlying regular FS for storage (e.g. ext4)Takes care of distribution of blocks across disksDon’t use RAIDMore disks means more IO throughputSends heartbeats to NameNodeReports blocks to NameNode (on startup)Does not know about the rest of the cluster (shared nothing)
  • 23. HDFS triviaHDFS is write once, read many, but has append support innewer versionsHas built in compression at the block levelDoes end-to-end checksumming on all data (on read andperiodically)HDFS is best used for large, unstructured volumes of raw datain BIG files used for batch operationsOptimized for sequential reads, not random access
  • 24. Job input comes from files on HDFSOther formats are possible; requires specialized InputFormatimplementationBuilt in support for text files (convenient for logs, csv, etc.)Files must be splittable for parallelization to workNot all compression formats have this property (e.g. gzip)
  • 25. JobTracker daemonMapReduce master nodeTakes care of scheduling and job submissionSplits jobs into tasks (Mappers and Reducers)Assigns tasks to worker nodesReassigns tasks in case of failureKeeps track of job progressKeeps track of worker nodes through heartbeats
  • 26. TaskTracker daemonMapReduce worker processStarts Mappers en Reducers assigned by JobTrackerSends heart beats to the JobTrackerSends task progress to the JobTrackerDoes not know about the rest of the cluster (shared nothing)
  • 27. NameNodeJobTracker DataNode DataNode DataNode DataNodeTaskTracker TaskTracker TaskTracker TaskTracker
  • 28. “Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” http://flume.apache.org/
  • 29. image from http://flume.apache.org
  • 30. image from http://flume.apache.org
  • 31. image from http://flume.apache.org
  • 32. Processing the logs, a simple approach:Divide time into 15 minute bucketsCount # times a search term appears ineach bucketDivide counts by (current time - time ofstart of bucket)Rank terms according to sum of dividedcounts
  • 33. MapReduce, Job 1:FIFTEEN_MINUTES = 1000 * 60 * 15map (key, value) { term = getTermFromLogLine(value) time = parseTimestampFromLogLine(value) bucket = time - (time % FIFTEEN_MINUTES) outputKey = (bucket, term) outputValue = 1 emit(outputKey, outputValue)}reduce(key, values) { outputKey = getTermFromKey(key) outputValue = (getBucketFromKey(key), sum(values)) emit(outputKey, outputValue)}
  • 34. MapReduce, Job 2:map (key, value) { for each prefix of key { outputKey = prefix count = getCountFromValue(value) score = count / (now - getBucketFromValue(value)) }}reduce(key, values) { for each term in values { storeOrUpdateTermScore(value) } order terms by descending scores emit(key, top 10 terms)}
  • 35. Processing the logs, a simple solution:Run Job 1 on new input (written by Flume)When finished processing new logs, movefiles to another locationRun Job 2 after Job 1 on new and historicoutput of Job 1Push output of Job 2 to a databaseCleanup historic outputs of Job 1 after Ndays
  • 36. A database. Sort of... “Redis is an open source, advanced key-value store. It isoften referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.” http://redis.io/
  • 37. Insert into Redis:Update all keys after each jobSet a TTL on all keys of one hour; obsoleteentries go away automatically
  • 38. How much history do we need?
  • 39. Improvements:Use HTTP keep-alive to make sure the tcpstays openCreate a diff in Hadoop and only updaterelevant entries in RedisProduction hardening (error handling,packaging, deployment, etc.)
  • 40. Options for better algorithms:Use more signals than only search: Did a user type the text or click a suggestion?Look at cause-effect relationships withinsessions: Was it a successful search? Did the user buy the product?Personalize: Filtering of irrelevant results Re-ranking based on personal behavior
  • 41. Q&A code is athttps://github.com/friso/rug fvanvollenhoven@xebia.com @fzk

×