Big Data: Technologies &
Challenges Facing Business
Today
A Practical Guide to Getting Started
My Name is Sara Robertson Hello
About Me
• I’m the VP of Technology at
CPX Interactive.
• Previously ran the platform
team at Warner Music Group.
• I choose technology based on
cost, efficiency, availability of
talent, valuation potential, long
term outlook, community
support, and other reasons
besides just pure tech.
• I love agile, open source, bio-
hacking, anime and martial
arts movies, emo, audiobooks.
• Favorite tech skills:
optimization, debugging.
About Me + Data
• Started at a mainframe
company doing Nagios-style
server & network monitoring.
• Spent my early years
obsessed with Oracle, then
Postgres, then MySQL.
• Most of my data experience is
in either high-traffic web
applications or back-office data
warehousing.
• I believe that data is about
humans as much as it is about
technology, and if your solution
doesn’t speak to your users
then it’s not really a solution.
8/14/2014 3
Big Data Landscape
8/14/2014 4
Big Data We’ll Talk About Today
Four Problems With Data Problems
Too Big Too Fast Too Disparate Too Unwieldy
system
creative_id
cpx_creative_name
viq_creative_name
alt_creative_id
alt_creative_name
size
file_size
click_track
audit_status
type
li_number
li_name
camp_number
camp_name
viq_placement_id
x_cslookup
bigint
string
tinyint
smallint
smallint
tinyint
tinyint
string
tinyint
tinyint
float
tinyint
float
float
float
float
int
tinyint
tinyint
bigint
string
string
string
string
tinyint
tinyint
tinyint
int
int
int
int
string
int
int
float
float
float
int
string
string
float
float
float
int
int
int
int
int
int
int
int
int
int
int
float
tinyint
int
tinyint
tinyint
int
int
int
tinyint
int
int
float
float
string
string
string
float
float
float
float
float
float
float
float
string
string
float
hourlycpxDataSiphonHour.py
auction_id_64
date_time
user_tz_offset
width
height
media_type
fold_position
event_type
imp_type
payment_type
media_cost_dollars_cpm
revenue_type
buyer_spend
buyer_bid
ecp
eap
is_imp
is_learn
predict_type_rev
othuser_id_64
ip_address
ip_address_trunc
geo_country
geo_region
operating_system
browser
language
venue_id
seller_member_id
publisher_id
site_id
site_domain
tag_id
external_inv_id
reserve_price
seller_revenue_cpm
media_buy_rev_share_pct
pub_rule_id
seller_currency
publisher_currency
publisher_exchange_rate
serving_fees_cpm
serving_fees_revshare
buyer_member_id
advertiser_id
brand_id
advertiser_frequency
advertiser_recency
insertion_order_id
campaign_group_id
campaign_id
creative_id
creative_freq
creative_rec
cadence_modifier
can_convert
user_group_id
is_control
control_pct
control_creative_id
is_click
pixel_id
is_remarketing
post_click_conv
post_view_conv
post_click_revenue
post_view_revenue
order_id
external_data
pricing_type
booked_revenue_dollars
booked_revenue_adv_curr
commission_cpm
commission_revshare
auction_service_deduction
auction_service_fees
creative_overage_fees
clear_fees
buyer_currency
advertiser_currency
advertiser_exchange_rate
x_datasiphon
Google’
s qpms
in 2000 Google’
s qpms
in 2011
Old
Table
Layouts
New
Table
Layouts
Old File
Formats
New
File
Formats
Old Pattern of
Change in Data &
Business Processes
(Waterfall!)
New Pattern of
Change, Agile!
Data Problems in Advertising Examples
Our statistics at CPX…
• 5.5+ billion impressions per day
• 20+ billion bids per day
• 45+ billion segments per day
• 80+ columns per data stream
• Columns average 25+ bytes
That’s more than 100
Terabytes and 75 Billion
records every day! *
Website
Ad
Server
• Access Logs
• Cookies
• Interactions
…
• Predictions
• Demographi
cs
• Prices…
• Potential Buyers
• Market Trends
• Preferences…
Exchang
es
Bidders
• Bid Parameters
• Wins/Losses
• Ceilings/Floors
…
• Creatives
• Targeting
• Analytics…
• Bid Attempts
• Imp Value
• Market
Demand
• Revenue
• Costs
• Performance…
• Profit
• Winners
• Demographics
Life of a
Single Ad
Impressio
n
Data Problems in Music Examples
Our statistics at Warner…
• 5+ million radio plays daily
• 10+ million digital tracks sold
daily
• $1+ million in ecommerce daily
• 20+ million fans online
• 10-20 channels of interaction
with every fan
• Thousands of feeds of data that
could potentially mention a band
An essentially unlimited
supply of new data
streams with ever-
changing data formats!
1:07:01pm Radio Plays in Seattle:
MBUBLE HAVENT MET YOU YE
BUBLE´, MICHAEL
HAVEN’TMETYOUYET
1:07:02pm On Twitter in Cyberspace:
OMG this song is so sick! <3 #mbuble
#haventmet
This met you yet Bubble´ song makes me
sick.1:07:03pm On Website from China:
12 visitors to the website.
1:07:00pm On TV in New York:
Michael Buble´ appears on Oprah.
• First match up the many different
versions of the artist’s name
• Then Analyze sentiment to tell the
difference between uses of “sick”
• Then Augment sparse data streams
with useful dimensions (time,
location)
• Then decide how to correlate data!
Document
Store
Distributed File
System
Used for
unstructured fast-
flowing data
Massively
Parallel
MPPs, Used for structured high-
volume, high read + write data
Master/Sl
ave
Used for heavy read apps with normal
data volume & scale requirements
In-Memory
Columnar
Used for super
high speed read
only access to
cacheable data
Database Types
Document stores export into
Parallel and Master databases,
which cache into Columnar
databases.
Unstructured
Structured
Key-Value Pair
Used for semi-
structured fast-
flowing data
What we tried in Music Case Study
Roll-Your-Own approach
• Python + RabbitMQ +
MongoDB + PHP for
custom BI layer
• Custom development of
workflow, transformation,
storage, correlation,
smoothing, analysis
• Custom dev of dashboards,
reports, charts, etc for the
business
Why it didn’t work
• Bleeding edge technologies
were too immature and cost
of talent was too high
• Outsourced dev + insourced
support = fail
• Too much overhead to get a
usable product
What we’re doing in Advertising Case Study
Use-A-Stack approach
• Leverage a kick-start with
a stack that reduces
implementation time,
learning curve, and talent
costs
• Write pluggable modules
• Build the plan for multi-
layered data storage from
the beginning
Why it’s working
• By keeping our
investment and footprint
light, we’re able to
respond quickly to
changes in the industry &
technology ecosystem
• The multiple layers of
data are the key to
building products at scale
Data War ehouse
hosting
web platfor m
pr oducts
Development
3r d par ty integr ations
R&D
Custom
Modules
Custom
Themes
Contr ib
Modules
Contr ib
Themes
Contr ib
Cor e
Custom
Cor es
Contr ib
Libr ar ies
Glue
Code
custom SQL
Contr ib
Ser vices
Custom
Code
custom
Scr ipts
Paid
Ser vices
Contr ib
Tools
Drupal, Wordpress, PHP,
Javascript, jQuery, HTML, CSS,
Flash, Bash, Perl, etc...
MySQL, PostgreSQL, Hadoop,
Cloudera, Hive, Hue, Impala,
Python, Java, SQL, etc...
Amazon, Ubuntu, Apache2,
Nginx, Node.js, Memcache,
Highwinds CDN, etc…
Appnexus, Right Media, Google,
Salesforce, Zendesk, Microso ,
Chrome, Dropbox, etc...
Git, VSphere, VMware, Drush
Make, MAMP, Confluence,
Agile / Scrum, SOASTA, etc...
Mobile, Video, Bidders, IPs,
Viewability, Emerging Tech...
Data War ehouse
hosting
web platfor m
pr oducts
Development
3r d par ty integr ations
R&D
Custom
Modules
Custom
Themes
Contr ib
Modules
Contr ib
Themes
Contr ib
Cor e
Custom
Cor es
Contr ib
Libr ar ies
Glue
Code
custom SQL
Contr ib
Ser vices
Custom
Code
custom
Scr ipts
Paid
Ser vices
Contr ib
Tools
Drupal, Wordpress, PHP,
Javascript, jQuery, HTML, CSS,
Flash, Bash, etc...
MySQL, PostgreSQL, Hadoop,
Cloudera, Hive, Hue, Impala,
Python, Java, SQL, etc...
Amazon, Ubuntu, Apache2,
Nginx, Node.js, Memcache,
Highwinds CDN, etc…
Appnexus, Right Media, Google,
Salesforce, Zendesk, Microso ,
Chrome, Dropbox, etc...
Git, VSphere, VMware, Drush,
Make, MAMP, Confluence,
Scrum, etc...
Mobile, Video, Bidders, IPs,
Viewability, Emerging Tech...
What does our stack look like?
We
only
build
the
red
stuff!
Hadoop Distributed File System
HDFS: Everybody’s Doing
It
– It’s just a file system!
– Feed it gzips, csvs,
whatever you’ve got
– Command line + library
interface to read/write files
to it
– Can be slow due to
replication across network
to data nodes
– Not much different than
sed/awk
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Name
Node
Parallel DBs RDBMS
The Holy Grail of
Database Scalability
Claims of database parallelization in the
past have been greatly exaggerated.
Nevertheless I believe we might be
dawning on a new era in this space.
Paths to parallelization
Sharding: Manual split of tables into
independent db instances. Joins across
dbs not possible without manual extract
and re-load into one instance.
Federation: Automatic split of tables
into independent db instances. Joins
across dbs managed by high-level
software layer that extracts data and
joins/merges outside the db instances.
Performance penalty in data
extraction/merge. Much redundant
work performed by each db instance in
parsing and compiling SQL.
True-MPP: Only one db instance, with
multiple compute/storage nodes. All
Joins across nodes are managed
natively by the execution engine within
the db instance. No redundant work
performed, no performance penalties.
Technique
:
Degree of Automation: Vendor: Price:
Manual Semi-
Manual
Fully
Auto
Sharding X Various… Low
Federation X GreenPlum High
True-MPP X Netezza High
True-MPP X XtremeData Low
* Thanks to Ravi the CTO of Xtreme Data for contributing this break-down!
Traditional DBs RDBMS
MySQL Is Your Best Friend
• Feeding it from a
warehouse is the hardest
part; needs workflow
software and reduce jobs
• Works great for read-
heavy web applications
• Cheap talent, cheap
hosting, tons of support
• Creativity required for
heavy writes, i.e. node.js
+ queuing mechanism
In-Memory Columnar DBs In-Memory
The New Memcached
• In memory DBs or
“Columnar” databases are
just key-value pairs:
put(‘name’, ‘value’)
• Some sophisticated layers
have been built on top to
turn it into near-SQL
• Crazy fast solution for read-
heavy systems like analytics
• Still needs workflow,
management, and a
traditional backend storage
system
Big Data Distribution Requirements Hosting
Massive Cheap Infrastructure
• Crazy virtual server farms!
1000+ servers get created and
destroyed to perform 1 job
• Automation and deployment of
these servers is crucial,
infrastructure automation is the
new hot skill
• Small-to-medium systems or
growing products use cloud
first and only invest in metal
once stabilized, and even then
it’s rarely cost effective
• Connections between the
servers drives the performance
of your data warehouse
solution!
Life in the Cloud:
It’s so different. Forget
everything you thought you
knew. Except Unix.
Major Bottlenecks:
• Reading & writing to disk: disks are
usually network-connected to cloud-
based servers
• Communicating with other servers
during replication; need to shave off
milliseconds with optimizations
• Staying ahead of storage space
limitations with archive jobs
• Partitioning large datasets based on
primary reduce filters
• Keeping up with your dataset when you
start to get behind
Coordination Strategies Implementation
Amazon Hosted Cloud
• It’s like a treasure hunt.*
Vsphere/Openstack Private Cloud
• Roll your own someday.
Cloudera Hadoop Management
• You will love life.
Chef Deployment Automation
• OMG life gets better.
* Note: Rackspace is also awesome.
Workflow Strategies Implementation
You Still Need Data In & Out
• Hive/Pig – You definitely need them
– SQL sits on top of Hadoop so you
can query flat files like a table!
– Outputs into RDBMS is easy, but
managing the jobs is hard
– Nobody wants to learn Map Reduce
• Custom Coding
– Long term supportability is low
– High cost & slow to market
• Cloudera has Workflow Services!
– Impala
– Flume
No one is really the standard in this space yet, although there are a lot
of really interesting players. Check out the Big Data chart for more!
The Big Data Adoption Problem Adoption
Problems:
• People don’t know what to do with the data
or how to gain insights
• The data changes too fast for traditional
software development; they don’t know what
they want until they see it, they can’t see it
until they tell you what they want!
• If it can’t feel the benefits of the
infrastructure, the business can’t continue to
invest in big data
Solutions:
• Open up windows into the workflow so
humans can dig around and discover things
in the data, teach everyone SQL
• Provide simple BI and visualization solutions
that don’t require custom development
• Support the classical Excel part of the
business world, and make your data
accessible in tabular exports
• Continue development on custom reporting
platforms, learning from the first three steps
along the way
Fancy Stuff Adoption
If you’ve come this far you can finally have…
• Statistical modeling
• Sentiment analysis
• Prediction algorithms
• Machine learning
• Mmmmmmm fun stuff…
BUT NOT UNTIL YOU CAN SUPPORT IT!!!! 
The Most Important Things to Know Cheatsheet
• It’s still all about the reads vs. writes
• HDFS is just a file system format for documents
• Hadoop is just for crunching and outputting into normal
databases, you don’t actually point an application at it
• MPPs are awesome and the wave of the future
• In-memory columnar databases are all the rage
(because they’re crazy fast) and will probably be a
requirement for all high-scale apps in the future
• Don’t forget to become awesome at Unix system &
network administration, because all the same commands
work in the cloud and it’s the only way to understand
what’s going on underneath the hood!
What to do right now Try it Out
• Download Openstack and install it on your laptop OR Register for
Amazon AWS
• In your new Cloud:
– Download & install Cloudera Community
– Spin up a few servers & add them to Cloudera
– Find Open Source Xtreme Data MPP in the Marketplace
• Get more advanced:
– Setup a Chef implementation, try automating a few server spin-
up & spin-downs
– Try the open source Druid in-memory DB
– Setup a node.js server w/ Express and pipe in some real-time
data
– Write a real-time data analytics front-end to see if it works!
• Where to get help?
– Forums are your best friend!
– IRC is your worst enemy but it’s still there for you!
– Wikipedia, Youtube, etc all have great resources to learn.

Hofstra University - Overview of Big Data

  • 1.
    Big Data: Technologies& Challenges Facing Business Today A Practical Guide to Getting Started
  • 2.
    My Name isSara Robertson Hello About Me • I’m the VP of Technology at CPX Interactive. • Previously ran the platform team at Warner Music Group. • I choose technology based on cost, efficiency, availability of talent, valuation potential, long term outlook, community support, and other reasons besides just pure tech. • I love agile, open source, bio- hacking, anime and martial arts movies, emo, audiobooks. • Favorite tech skills: optimization, debugging. About Me + Data • Started at a mainframe company doing Nagios-style server & network monitoring. • Spent my early years obsessed with Oracle, then Postgres, then MySQL. • Most of my data experience is in either high-traffic web applications or back-office data warehousing. • I believe that data is about humans as much as it is about technology, and if your solution doesn’t speak to your users then it’s not really a solution.
  • 3.
  • 4.
    8/14/2014 4 Big DataWe’ll Talk About Today
  • 5.
    Four Problems WithData Problems Too Big Too Fast Too Disparate Too Unwieldy system creative_id cpx_creative_name viq_creative_name alt_creative_id alt_creative_name size file_size click_track audit_status type li_number li_name camp_number camp_name viq_placement_id x_cslookup bigint string tinyint smallint smallint tinyint tinyint string tinyint tinyint float tinyint float float float float int tinyint tinyint bigint string string string string tinyint tinyint tinyint int int int int string int int float float float int string string float float float int int int int int int int int int int int float tinyint int tinyint tinyint int int int tinyint int int float float string string string float float float float float float float float string string float hourlycpxDataSiphonHour.py auction_id_64 date_time user_tz_offset width height media_type fold_position event_type imp_type payment_type media_cost_dollars_cpm revenue_type buyer_spend buyer_bid ecp eap is_imp is_learn predict_type_rev othuser_id_64 ip_address ip_address_trunc geo_country geo_region operating_system browser language venue_id seller_member_id publisher_id site_id site_domain tag_id external_inv_id reserve_price seller_revenue_cpm media_buy_rev_share_pct pub_rule_id seller_currency publisher_currency publisher_exchange_rate serving_fees_cpm serving_fees_revshare buyer_member_id advertiser_id brand_id advertiser_frequency advertiser_recency insertion_order_id campaign_group_id campaign_id creative_id creative_freq creative_rec cadence_modifier can_convert user_group_id is_control control_pct control_creative_id is_click pixel_id is_remarketing post_click_conv post_view_conv post_click_revenue post_view_revenue order_id external_data pricing_type booked_revenue_dollars booked_revenue_adv_curr commission_cpm commission_revshare auction_service_deduction auction_service_fees creative_overage_fees clear_fees buyer_currency advertiser_currency advertiser_exchange_rate x_datasiphon Google’ s qpms in 2000 Google’ s qpms in 2011 Old Table Layouts New Table Layouts Old File Formats New File Formats Old Pattern of Change in Data & Business Processes (Waterfall!) New Pattern of Change, Agile!
  • 6.
    Data Problems inAdvertising Examples Our statistics at CPX… • 5.5+ billion impressions per day • 20+ billion bids per day • 45+ billion segments per day • 80+ columns per data stream • Columns average 25+ bytes That’s more than 100 Terabytes and 75 Billion records every day! * Website Ad Server • Access Logs • Cookies • Interactions … • Predictions • Demographi cs • Prices… • Potential Buyers • Market Trends • Preferences… Exchang es Bidders • Bid Parameters • Wins/Losses • Ceilings/Floors … • Creatives • Targeting • Analytics… • Bid Attempts • Imp Value • Market Demand • Revenue • Costs • Performance… • Profit • Winners • Demographics Life of a Single Ad Impressio n
  • 7.
    Data Problems inMusic Examples Our statistics at Warner… • 5+ million radio plays daily • 10+ million digital tracks sold daily • $1+ million in ecommerce daily • 20+ million fans online • 10-20 channels of interaction with every fan • Thousands of feeds of data that could potentially mention a band An essentially unlimited supply of new data streams with ever- changing data formats! 1:07:01pm Radio Plays in Seattle: MBUBLE HAVENT MET YOU YE BUBLE´, MICHAEL HAVEN’TMETYOUYET 1:07:02pm On Twitter in Cyberspace: OMG this song is so sick! <3 #mbuble #haventmet This met you yet Bubble´ song makes me sick.1:07:03pm On Website from China: 12 visitors to the website. 1:07:00pm On TV in New York: Michael Buble´ appears on Oprah. • First match up the many different versions of the artist’s name • Then Analyze sentiment to tell the difference between uses of “sick” • Then Augment sparse data streams with useful dimensions (time, location) • Then decide how to correlate data!
  • 8.
    Document Store Distributed File System Used for unstructuredfast- flowing data Massively Parallel MPPs, Used for structured high- volume, high read + write data Master/Sl ave Used for heavy read apps with normal data volume & scale requirements In-Memory Columnar Used for super high speed read only access to cacheable data Database Types Document stores export into Parallel and Master databases, which cache into Columnar databases. Unstructured Structured Key-Value Pair Used for semi- structured fast- flowing data
  • 10.
    What we triedin Music Case Study Roll-Your-Own approach • Python + RabbitMQ + MongoDB + PHP for custom BI layer • Custom development of workflow, transformation, storage, correlation, smoothing, analysis • Custom dev of dashboards, reports, charts, etc for the business Why it didn’t work • Bleeding edge technologies were too immature and cost of talent was too high • Outsourced dev + insourced support = fail • Too much overhead to get a usable product
  • 11.
    What we’re doingin Advertising Case Study Use-A-Stack approach • Leverage a kick-start with a stack that reduces implementation time, learning curve, and talent costs • Write pluggable modules • Build the plan for multi- layered data storage from the beginning Why it’s working • By keeping our investment and footprint light, we’re able to respond quickly to changes in the industry & technology ecosystem • The multiple layers of data are the key to building products at scale
  • 12.
    Data War ehouse hosting webplatfor m pr oducts Development 3r d par ty integr ations R&D Custom Modules Custom Themes Contr ib Modules Contr ib Themes Contr ib Cor e Custom Cor es Contr ib Libr ar ies Glue Code custom SQL Contr ib Ser vices Custom Code custom Scr ipts Paid Ser vices Contr ib Tools Drupal, Wordpress, PHP, Javascript, jQuery, HTML, CSS, Flash, Bash, Perl, etc... MySQL, PostgreSQL, Hadoop, Cloudera, Hive, Hue, Impala, Python, Java, SQL, etc... Amazon, Ubuntu, Apache2, Nginx, Node.js, Memcache, Highwinds CDN, etc… Appnexus, Right Media, Google, Salesforce, Zendesk, Microso , Chrome, Dropbox, etc... Git, VSphere, VMware, Drush Make, MAMP, Confluence, Agile / Scrum, SOASTA, etc... Mobile, Video, Bidders, IPs, Viewability, Emerging Tech... Data War ehouse hosting web platfor m pr oducts Development 3r d par ty integr ations R&D Custom Modules Custom Themes Contr ib Modules Contr ib Themes Contr ib Cor e Custom Cor es Contr ib Libr ar ies Glue Code custom SQL Contr ib Ser vices Custom Code custom Scr ipts Paid Ser vices Contr ib Tools Drupal, Wordpress, PHP, Javascript, jQuery, HTML, CSS, Flash, Bash, etc... MySQL, PostgreSQL, Hadoop, Cloudera, Hive, Hue, Impala, Python, Java, SQL, etc... Amazon, Ubuntu, Apache2, Nginx, Node.js, Memcache, Highwinds CDN, etc… Appnexus, Right Media, Google, Salesforce, Zendesk, Microso , Chrome, Dropbox, etc... Git, VSphere, VMware, Drush, Make, MAMP, Confluence, Scrum, etc... Mobile, Video, Bidders, IPs, Viewability, Emerging Tech... What does our stack look like? We only build the red stuff!
  • 13.
    Hadoop Distributed FileSystem HDFS: Everybody’s Doing It – It’s just a file system! – Feed it gzips, csvs, whatever you’ve got – Command line + library interface to read/write files to it – Can be slow due to replication across network to data nodes – Not much different than sed/awk Data Node Data Node Data Node Data Node Data Node Data Node Name Node
  • 14.
    Parallel DBs RDBMS TheHoly Grail of Database Scalability Claims of database parallelization in the past have been greatly exaggerated. Nevertheless I believe we might be dawning on a new era in this space. Paths to parallelization Sharding: Manual split of tables into independent db instances. Joins across dbs not possible without manual extract and re-load into one instance. Federation: Automatic split of tables into independent db instances. Joins across dbs managed by high-level software layer that extracts data and joins/merges outside the db instances. Performance penalty in data extraction/merge. Much redundant work performed by each db instance in parsing and compiling SQL. True-MPP: Only one db instance, with multiple compute/storage nodes. All Joins across nodes are managed natively by the execution engine within the db instance. No redundant work performed, no performance penalties. Technique : Degree of Automation: Vendor: Price: Manual Semi- Manual Fully Auto Sharding X Various… Low Federation X GreenPlum High True-MPP X Netezza High True-MPP X XtremeData Low * Thanks to Ravi the CTO of Xtreme Data for contributing this break-down!
  • 15.
    Traditional DBs RDBMS MySQLIs Your Best Friend • Feeding it from a warehouse is the hardest part; needs workflow software and reduce jobs • Works great for read- heavy web applications • Cheap talent, cheap hosting, tons of support • Creativity required for heavy writes, i.e. node.js + queuing mechanism
  • 16.
    In-Memory Columnar DBsIn-Memory The New Memcached • In memory DBs or “Columnar” databases are just key-value pairs: put(‘name’, ‘value’) • Some sophisticated layers have been built on top to turn it into near-SQL • Crazy fast solution for read- heavy systems like analytics • Still needs workflow, management, and a traditional backend storage system
  • 17.
    Big Data DistributionRequirements Hosting Massive Cheap Infrastructure • Crazy virtual server farms! 1000+ servers get created and destroyed to perform 1 job • Automation and deployment of these servers is crucial, infrastructure automation is the new hot skill • Small-to-medium systems or growing products use cloud first and only invest in metal once stabilized, and even then it’s rarely cost effective • Connections between the servers drives the performance of your data warehouse solution! Life in the Cloud: It’s so different. Forget everything you thought you knew. Except Unix. Major Bottlenecks: • Reading & writing to disk: disks are usually network-connected to cloud- based servers • Communicating with other servers during replication; need to shave off milliseconds with optimizations • Staying ahead of storage space limitations with archive jobs • Partitioning large datasets based on primary reduce filters • Keeping up with your dataset when you start to get behind
  • 18.
    Coordination Strategies Implementation AmazonHosted Cloud • It’s like a treasure hunt.* Vsphere/Openstack Private Cloud • Roll your own someday. Cloudera Hadoop Management • You will love life. Chef Deployment Automation • OMG life gets better. * Note: Rackspace is also awesome.
  • 19.
    Workflow Strategies Implementation YouStill Need Data In & Out • Hive/Pig – You definitely need them – SQL sits on top of Hadoop so you can query flat files like a table! – Outputs into RDBMS is easy, but managing the jobs is hard – Nobody wants to learn Map Reduce • Custom Coding – Long term supportability is low – High cost & slow to market • Cloudera has Workflow Services! – Impala – Flume No one is really the standard in this space yet, although there are a lot of really interesting players. Check out the Big Data chart for more!
  • 20.
    The Big DataAdoption Problem Adoption Problems: • People don’t know what to do with the data or how to gain insights • The data changes too fast for traditional software development; they don’t know what they want until they see it, they can’t see it until they tell you what they want! • If it can’t feel the benefits of the infrastructure, the business can’t continue to invest in big data Solutions: • Open up windows into the workflow so humans can dig around and discover things in the data, teach everyone SQL • Provide simple BI and visualization solutions that don’t require custom development • Support the classical Excel part of the business world, and make your data accessible in tabular exports • Continue development on custom reporting platforms, learning from the first three steps along the way
  • 21.
    Fancy Stuff Adoption Ifyou’ve come this far you can finally have… • Statistical modeling • Sentiment analysis • Prediction algorithms • Machine learning • Mmmmmmm fun stuff… BUT NOT UNTIL YOU CAN SUPPORT IT!!!! 
  • 22.
    The Most ImportantThings to Know Cheatsheet • It’s still all about the reads vs. writes • HDFS is just a file system format for documents • Hadoop is just for crunching and outputting into normal databases, you don’t actually point an application at it • MPPs are awesome and the wave of the future • In-memory columnar databases are all the rage (because they’re crazy fast) and will probably be a requirement for all high-scale apps in the future • Don’t forget to become awesome at Unix system & network administration, because all the same commands work in the cloud and it’s the only way to understand what’s going on underneath the hood!
  • 23.
    What to doright now Try it Out • Download Openstack and install it on your laptop OR Register for Amazon AWS • In your new Cloud: – Download & install Cloudera Community – Spin up a few servers & add them to Cloudera – Find Open Source Xtreme Data MPP in the Marketplace • Get more advanced: – Setup a Chef implementation, try automating a few server spin- up & spin-downs – Try the open source Druid in-memory DB – Setup a node.js server w/ Express and pipe in some real-time data – Write a real-time data analytics front-end to see if it works! • Where to get help? – Forums are your best friend! – IRC is your worst enemy but it’s still there for you! – Wikipedia, Youtube, etc all have great resources to learn.

Editor's Notes

  • #8 Would Michael Buble stepping on the stage to an east coast audience inspire 12 visitors to the website from china 3 seconds later???
  • #12 It’s the same as a decision to use a web framework instead of write a new session handler over and over… who wants to write workflow & job management solutions from scratch?