http://isnatesilverawitch.com/Everyone predicted the election correctly. The RCP model got every state but Florida, PEC said it was a tossup, and 538 got every single state right.
MarkosMoulitsas over at theDailyKos did even better than Nate at predicting the share of the vote within the swing states. Don’t think that math can always out-perform an expert armed with good data.http://news.cnet.com/8301-13578_3-57546778-38/among-the-top-election-quants-nate-silver-reigns-supreme/
Index fund == simple average.Hedge fund == 538Warren Buffett == Expert with good data
Classical data economics: If the value I can extract from a byte is greater than the cost to store it, then I throw it away or store it on tape.
We use metaphors that help us understand new technology in terms of the old. Translatedesktop tools and metaphors on to Hadoop, even when we’re working with specialized data types: http://blog.cloudera.com/blog/2012/01/seismic-data-science-hadoop-use-case/
It’s a data warehousing metaphor– not an actual data warehouse. Schema on read vs. schema on write, for example. Non-interactive for the most part. Think of ELT, not interactive queries.
We borrow these abstractions because they make it easy to get started, but they don’t necessarily conform to the user’s expectations of how Hadoop will work.If you think of Hadoop as a really big database, or as a spreadsheet that goes on forever and ever, then you have failed to understand Hadoop.
Impala is about fulfilling those abstractions, esp. for interactive queries of relational-style data on Hadoop.
But we can also go beyond the abstractions and study how Hadoop can be effective for new kinds of analytic applications.
Step 1: Study real problems. Especially real problems where non-sophisticated users (e.g., people who don’t even know SQL) need to do sophisticated analysis on large quantities of information.
I realized earlier this year that other people do not use Hive the way that I use Hive, and so we created the data science course to take people through the problem of building an analytical application from start to finish on Hadoop.http://blog.cloudera.com/blog/2012/10/data-science-training/
They are applications that allow users to work with and make decisions from data.
Big Data Economics • No individual record is particularly valuable • Having every record is incredibly valuable • Web index • Recommendation systems • Sensor data • Market basket analysis • Online advertising25
The Hadoop Distributed File System • Based on the Google File System • Data stored in large files • Large block size: 64MB to 256MB per block • Blocks are replicated to multiple nodes in the cluster27
Simple, Reliable Processing: MapReduce • Map Stage • Embarrassingly parallel • Shuffle Stage: Large-scale distributed sort • Reduce Stage • Process all of the values that have the same key in a single step • Process the data where it is stored • Write once and you’re done.28
Developing Analytical Applications with Hadoop29
The Best Way to Get Started: Apache Hive • Apache Hive • Data Warehouse System on top of Hadoop • SQL-based query language • SELECT, INSERT, CREATE TABLE • Includes some MapReduce- specific extensions31
A Couple of Themes 1. Structure data the data in the way that makes sense for the problem. 2. Interactive inputs, not just interactive outputs. 3. Simpler interfaces that yield more sophisticated answers.42