Key DriversSpread of cloud computing, mobile computing and social mediatechnologies, financial transactions
Sources of Big Data• Chatter from social networks,• Web server logs,• Traffic flow sensors,• Satellite imagery,• Broadcast audio streams,• Banking transactions,• MP3s of rock music,• The content of web pages,• Scans of government documents,• GPS trails,• Telemetry from automobiles,• Financial market data• ….
Process PipelineSource: http://radar.oreilly.com
HadoopA distributed processing Framework based on Map/Reduce
PigA platform for analyzing large data sets that consists of a high-level language forexpressing data analysis programs, coupled with infrastructure for evaluating these programs.
MahoutA machine learning library with algorithms for clustering, classification and batch based collaborative filtering that are implemented on top of Apache Hadoop.
HiveData warehouse software built on top ofApache Hadoop that facilitates queryingand managing large datasets residing in distributed storage.
PegasusA Peta-scale graph mining system that runs in parallel, distributed manner on top of Hadoop
SqoopA tool designed for efficiently transferring bulk data between Apache Hadoop andstructured data stores such as relational databases.
Flume A distributed service forcollecting, aggregating, and moving large log data amounts to HDFS.
Yahoo S4 S4 is a general-purpose, distributed, scalable,partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
Twitter StormStorm can be used to process astream of new data and update databases in real time.
TrendsFunding, Companies, Applications, Jo bs, IPOs
Funding & IPO• Cloudera, (Commerical Hadoop) more than $75 million• MapR (Cloudera competitor) has raised more than $25 million• 10Gen (Maker of the MongoDB) $32 million• DataStax (Products based on Apache Cassandra) $11 million• Splunk raised about $230 million through IPO
Big Data Application Domains• Healthcare• The public sector• Retail• Manufacturing• Personal-location data• Finance
Future of Big Data• More Powerful and Expressive Tools for Analysis• Streaming Data Processing (Storm from Twitter and S4 from Yahoo)• Rise of Data Market Places (InfoChimps, Azure Marketplace)• Development of Data Science Workflows and Tools (Chorus, The Guardian, New York Times)• Increased Understanding of Analysis and Visualizationhttp://www.evolven.com/blog/big-data-predictions.html