5. What is Apache Hadoop ? It is a cluster technology with a single master and multiple slaves, designed for commodity hardware It consists of two runtimes, the Hadoop distributed file system ( HDFS ) and Map/Reduce As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine ( data locality ). Hadoop may execute or re-execute a job on any node in the cluster. Node failures are automatically handled by the framework.
6. The Big Data Ecosystem ClusterChef / Apache Whirr / EC2 Hadoop Pig / WuKong /Cascading Cassandra / HBase Offline Systems (Analytics) Human Consumption BigSheets / DataMeer Hive / Karmasphere Provisioning Nutch / SQOOP / Flume Scripting DBA Non-Programmer Import/Export Tooling Visualizations Online Systems (OLTP @ Scale) NoSQL Commodity Hardware
26. Data flutters by label Elephants make sturdy piles {GROUP} Number becomes thought process_group Hadoop
27. Twitter Parser in a Tweet class TwStP < Streamer def process line a = JSON.load(line) rescue {} yield a.values_at(*a.keys.sort) endendWukong.run(TwStP)
35. Thanks for coming! Stu Hood @stuhood Flip Kromer @mrflip Matt Pfeil @mattz62 Eric Sammer @esammer Steve Watt @wattsteve
Editor's Notes
What about cascading?
What did a user buy? – We don’t want to recommend products they’ve already purchased in the past. What products did a user browse / hover / rate / add to cart – These all provide additional information we can use in making meaningful recommendations. Obviously you wouldn’t want to recommend a product a user has rated poorly (which may also indicate they own it but bought it from someone else). We can sometimes derive implicit information like this from explicit actions. Adding an item to a cart but not purchasing it indicates a strong “likely purchase.” User attributes can tell us more about the appropriate products to recommend for a user. Maybe we should recommend products within a user’s expected budget to increase likelihood of purchase. Also, we don’t want to offer deep discounts to users who can (or have historically) paid full price for products. Knowing what our margins on a product are can also impact our decisions. Inventory is also interesting – we don’t want to recommend something that’s not in stock!
If we can read data sequentially at ~250MBps (SATA II, 3Gbps minus some overhead) we’re talking about 1200GB * 1024 (convert to MB) / 250MBps = 4952 seconds / 60 = ~82 minutes just to read the activity logs. This doesn’t account for growth. Distilling data (i.e. grouping users to interested categories and then picking interesting products in each category) drastically reduces accuracy. More data helps. Could we provide better recommendations by looking at year back? What if we looked at people who browsed / bought a significant number of products in common with you and based recommendations on that? What if we could take social data like FB, twitter into account and recommend products based on friends? Repeating this operation to keep data up to date could get expensive quickly.
A batch operation to calculate recommendations daily probably makes sense. For each user, store a list of product IDs and present them pseudo-randomly or when a user views a product in the same category. Hadoop Map Reduce allows you to parallelize this type of operation so it can be run on hundreds or thousands of machines.
Walk through the basic architecture for a recommendation engine running on Hadoop with RDBMS import / export and app servers logging to HDFS. Mention RDBMS could be HBase, Cassandra, or other “big data” data storage. Sqoop can do RDBMS / Hadoop imp / exp Flume and similar systems can facilitate reliable logging to HDFS from various sources.
- From the Big Data leaders – Google + Amazon, Facebook
Scalability = ability to add machines for horizontal/easy growth Redundancy = data replicated High performance = Fast.
Cost = ~$300K
So outside infochimps the internet is this wonderful smorgasbord of facts: With Google and Wikipedia and the rest you can find something about anything you seek.
But we’re ready to move to a world where you can find heartier fare -- to find out everything about something
Program according to this haiku and hadoop takes care of the janitorial work
Here’s a working twitter stream JSON-to-TSV parser that fits in a tweet Run it with cat stream.json | ruby -r'wukong/script' -rjson parser_in_a_tweet.rb --map both fit comfortably in a tweet.
what’s left is pure functionality,
which to a programmer is pure fun.
Now I know this sounds like the lunacy of a ritalin-addled architecture astronaut spending too much time on StackOverflow.
Now I know this sounds like the lunacy of a ritalin-addled architecture astronaut spending too much time on StackOverflow.