Big Data – A Brief Overview
Petabytes, Hadoop, Analytics,
Collaborative business intelligence,
Data scientists, In-Memory Databases,
NoSQL platforms
Big Data
• What is it?
• Where does it come from?
• How do we process it?
• What do we do with it?
• Who are the players?
• What are the opportunities?
What Is Big Data?
Like the term Cloud, it is a bit
Nebulous
Key Drivers
Spread of cloud computing, mobile
computing and social media
technologies, financial transactions
Sources of Big Data
• Chatter from social networks,
• Web server logs,
• Traffic flow sensors,
• Satellite imagery,
• Broadcast audio streams,
• Banking transactions,
• MP3s of rock music,
• The content of web pages,
• Scans of government documents,
• GPS trails,
• Telemetry from automobiles,
• Financial market data
• ….
Pig
A platform for analyzing large data sets that
consists of a high-level language for
expressing data analysis programs, coupled
with infrastructure for evaluating these
programs.
Mahout
A machine learning library with algorithms
for clustering, classification and batch
based collaborative filtering that are
implemented on top of Apache Hadoop.
Hive
Data warehouse software built on top of
Apache Hadoop that facilitates querying
and managing large datasets residing in
distributed storage.
Sqoop
A tool designed for efficiently transferring
bulk data between Apache Hadoop and
structured data stores such as relational
databases.
Flume
A distributed service for
collecting, aggregating, and moving large
log data amounts to HDFS.
Yahoo S4
S4 is a general-purpose, distributed, scalable,
partially fault-tolerant, pluggable platform that
allows programmers to easily develop
applications for processing continuous
unbounded streams of data.
Twitter Storm
Storm can be used to process a
stream of new data and update
databases in real time.
Funding & IPO
• Cloudera, (Commerical Hadoop) more than
$75 million
• MapR (Cloudera competitor) has raised more
than $25 million
• 10Gen (Maker of the MongoDB) $32 million
• DataStax (Products based on Apache
Cassandra) $11 million
• Splunk raised about $230 million through IPO
Big Data Application Domains
• Healthcare
• The public sector
• Retail
• Manufacturing
• Personal-location data
• Finance
Future of Big Data
• More Powerful and Expressive Tools for Analysis
• Streaming Data Processing (Storm from Twitter and S4 from
Yahoo)
• Rise of Data Market Places (InfoChimps, Azure
Marketplace)
• Development of Data Science Workflows and Tools
(Chorus, The Guardian, New York Times)
• Increased Understanding of Analysis and Visualization
http://www.evolven.com/blog/big-data-predictions.html