Hofstra University - Overview of Big Data

Big Data: Technologies &
Challenges Facing Business
Today
A Practical Guide to Getting Started

My Name is Sara Robertson Hello
About Me
• I’m the VP of Technology at
CPX Interactive.
• Previously ran the platform
team at Warner Music Group.
• I choose technology based on
cost, efficiency, availability of
talent, valuation potential, long
term outlook, community
support, and other reasons
besides just pure tech.
• I love agile, open source, bio-
hacking, anime and martial
arts movies, emo, audiobooks.
• Favorite tech skills:
optimization, debugging.
About Me + Data
• Started at a mainframe
company doing Nagios-style
server & network monitoring.
• Spent my early years
obsessed with Oracle, then
Postgres, then MySQL.
• Most of my data experience is
in either high-traffic web
applications or back-office data
warehousing.
• I believe that data is about
humans as much as it is about
technology, and if your solution
doesn’t speak to your users
then it’s not really a solution.

8/14/2014 3
Big Data Landscape

8/14/2014 4
Big Data We’ll Talk About Today

Four Problems With Data Problems
Too Big Too Fast Too Disparate Too Unwieldy
system
creative_id
cpx_creative_name
viq_creative_name
alt_creative_id
alt_creative_name
size
file_size
click_track
audit_status
type
li_number
li_name
camp_number
camp_name
viq_placement_id
x_cslookup
bigint
string
tinyint
smallint
smallint
tinyint
tinyint
string
tinyint
tinyint
float
tinyint
float
float
float
float
int
tinyint
tinyint
bigint
string
string
string
string
tinyint
tinyint
tinyint
int
int
int
int
string
int
int
float
float
float
int
string
string
float
float
float
int
int
int
int
int
int
int
int
int
int
int
float
tinyint
int
tinyint
tinyint
int
int
int
tinyint
int
int
float
float
string
string
string
float
float
float
float
float
float
float
float
string
string
float
hourlycpxDataSiphonHour.py
auction_id_64
date_time
user_tz_offset
width
height
media_type
fold_position
event_type
imp_type
payment_type
media_cost_dollars_cpm
revenue_type
buyer_spend
buyer_bid
ecp
eap
is_imp
is_learn
predict_type_rev
othuser_id_64
ip_address
ip_address_trunc
geo_country
geo_region
operating_system
browser
language
venue_id
seller_member_id
publisher_id
site_id
site_domain
tag_id
external_inv_id
reserve_price
seller_revenue_cpm
media_buy_rev_share_pct
pub_rule_id
seller_currency
publisher_currency
publisher_exchange_rate
serving_fees_cpm
serving_fees_revshare
buyer_member_id
advertiser_id
brand_id
advertiser_frequency
advertiser_recency
insertion_order_id
campaign_group_id
campaign_id
creative_id
creative_freq
creative_rec
cadence_modifier
can_convert
user_group_id
is_control
control_pct
control_creative_id
is_click
pixel_id
is_remarketing
post_click_conv
post_view_conv
post_click_revenue
post_view_revenue
order_id
external_data
pricing_type
booked_revenue_dollars
booked_revenue_adv_curr
commission_cpm
commission_revshare
auction_service_deduction
auction_service_fees
creative_overage_fees
clear_fees
buyer_currency
advertiser_currency
advertiser_exchange_rate
x_datasiphon
Google’
s qpms
in 2000 Google’
s qpms
in 2011
Old
Table
Layouts
New
Table
Layouts
Old File
Formats
New
File
Formats
Old Pattern of
Change in Data &
Business Processes
(Waterfall!)
New Pattern of
Change, Agile!

Data Problems in Advertising Examples
Our statistics at CPX…
• 5.5+ billion impressions per day
• 20+ billion bids per day
• 45+ billion segments per day
• 80+ columns per data stream
• Columns average 25+ bytes
That’s more than 100
Terabytes and 75 Billion
records every day! *
Website
Ad
Server
• Access Logs
• Cookies
• Interactions
…
• Predictions
• Demographi
cs
• Prices…
• Potential Buyers
• Market Trends
• Preferences…
Exchang
es
Bidders
• Bid Parameters
• Wins/Losses
• Ceilings/Floors
…
• Creatives
• Targeting
• Analytics…
• Bid Attempts
• Imp Value
• Market
Demand
• Revenue
• Costs
• Performance…
• Profit
• Winners
• Demographics
Life of a
Single Ad
Impressio
n

Data Problems in Music Examples
Our statistics at Warner…
• 5+ million radio plays daily
• 10+ million digital tracks sold
daily
• $1+ million in ecommerce daily
• 20+ million fans online
• 10-20 channels of interaction
with every fan
• Thousands of feeds of data that
could potentially mention a band
An essentially unlimited
supply of new data
streams with ever-
changing data formats!
1:07:01pm Radio Plays in Seattle:
MBUBLE HAVENT MET YOU YE
BUBLE´, MICHAEL
HAVEN’TMETYOUYET
1:07:02pm On Twitter in Cyberspace:
OMG this song is so sick! <3 #mbuble
#haventmet
This met you yet Bubble´ song makes me
sick.1:07:03pm On Website from China:
12 visitors to the website.
1:07:00pm On TV in New York:
Michael Buble´ appears on Oprah.
• First match up the many different
versions of the artist’s name
• Then Analyze sentiment to tell the
difference between uses of “sick”
• Then Augment sparse data streams
with useful dimensions (time,
location)
• Then decide how to correlate data!

Document
Store
Distributed File
System
Used for
unstructured fast-
flowing data
Massively
Parallel
MPPs, Used for structured high-
volume, high read + write data
Master/Sl
ave
Used for heavy read apps with normal
data volume & scale requirements
In-Memory
Columnar
Used for super
high speed read
only access to
cacheable data
Database Types
Document stores export into
Parallel and Master databases,
which cache into Columnar
databases.
Unstructured
Structured
Key-Value Pair
Used for semi-
structured fast-
flowing data

What we tried in Music Case Study
Roll-Your-Own approach
• Python + RabbitMQ +
MongoDB + PHP for
custom BI layer
• Custom development of
workflow, transformation,
storage, correlation,
smoothing, analysis
• Custom dev of dashboards,
reports, charts, etc for the
business
Why it didn’t work
• Bleeding edge technologies
were too immature and cost
of talent was too high
• Outsourced dev + insourced
support = fail
• Too much overhead to get a
usable product

What we’re doing in Advertising Case Study
Use-A-Stack approach
• Leverage a kick-start with
a stack that reduces
implementation time,
learning curve, and talent
costs
• Write pluggable modules
• Build the plan for multi-
layered data storage from
the beginning
Why it’s working
• By keeping our
investment and footprint
light, we’re able to
respond quickly to
changes in the industry &
technology ecosystem
• The multiple layers of
data are the key to
building products at scale

Data War ehouse
hosting
web platfor m
pr oducts
Development
3r d par ty integr ations
R&D
Custom
Modules
Custom
Themes
Contr ib
Modules
Contr ib
Themes
Contr ib
Cor e
Custom
Cor es
Contr ib
Libr ar ies
Glue
Code
custom SQL
Contr ib
Ser vices
Custom
Code
custom
Scr ipts
Paid
Ser vices
Contr ib
Tools
Drupal, Wordpress, PHP,
Javascript, jQuery, HTML, CSS,
Flash, Bash, Perl, etc...
MySQL, PostgreSQL, Hadoop,
Cloudera, Hive, Hue, Impala,
Python, Java, SQL, etc...
Amazon, Ubuntu, Apache2,
Nginx, Node.js, Memcache,
Highwinds CDN, etc…
Appnexus, Right Media, Google,
Salesforce, Zendesk, Microso ,
Chrome, Dropbox, etc...
Git, VSphere, VMware, Drush
Make, MAMP, Confluence,
Agile / Scrum, SOASTA, etc...
Mobile, Video, Bidders, IPs,
Viewability, Emerging Tech...
Data War ehouse
hosting
web platfor m
pr oducts
Development
3r d par ty integr ations
R&D
Custom
Modules
Custom
Themes
Contr ib
Modules
Contr ib
Themes
Contr ib
Cor e
Custom
Cor es
Contr ib
Libr ar ies
Glue
Code
custom SQL
Contr ib
Ser vices
Custom
Code
custom
Scr ipts
Paid
Ser vices
Contr ib
Tools
Drupal, Wordpress, PHP,
Javascript, jQuery, HTML, CSS,
Flash, Bash, etc...
MySQL, PostgreSQL, Hadoop,
Cloudera, Hive, Hue, Impala,
Python, Java, SQL, etc...
Amazon, Ubuntu, Apache2,
Nginx, Node.js, Memcache,
Highwinds CDN, etc…
Appnexus, Right Media, Google,
Salesforce, Zendesk, Microso ,
Chrome, Dropbox, etc...
Git, VSphere, VMware, Drush,
Make, MAMP, Confluence,
Scrum, etc...
Mobile, Video, Bidders, IPs,
Viewability, Emerging Tech...
What does our stack look like?
We
only
build
the
red
stuff!

Hadoop Distributed File System
HDFS: Everybody’s Doing
It
– It’s just a file system!
– Feed it gzips, csvs,
whatever you’ve got
– Command line + library
interface to read/write files
to it
– Can be slow due to
replication across network
to data nodes
– Not much different than
sed/awk
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Name
Node

Parallel DBs RDBMS
The Holy Grail of
Database Scalability
Claims of database parallelization in the
past have been greatly exaggerated.
Nevertheless I believe we might be
dawning on a new era in this space.
Paths to parallelization
Sharding: Manual split of tables into
independent db instances. Joins across
dbs not possible without manual extract
and re-load into one instance.
Federation: Automatic split of tables
into independent db instances. Joins
across dbs managed by high-level
software layer that extracts data and
joins/merges outside the db instances.
Performance penalty in data
extraction/merge. Much redundant
work performed by each db instance in
parsing and compiling SQL.
True-MPP: Only one db instance, with
multiple compute/storage nodes. All
Joins across nodes are managed
natively by the execution engine within
the db instance. No redundant work
performed, no performance penalties.
Technique
:
Degree of Automation: Vendor: Price:
Manual Semi-
Manual
Fully
Auto
Sharding X Various… Low
Federation X GreenPlum High
True-MPP X Netezza High
True-MPP X XtremeData Low
* Thanks to Ravi the CTO of Xtreme Data for contributing this break-down!

Traditional DBs RDBMS
MySQL Is Your Best Friend
• Feeding it from a
warehouse is the hardest
part; needs workflow
software and reduce jobs
• Works great for read-
heavy web applications
• Cheap talent, cheap
hosting, tons of support
• Creativity required for
heavy writes, i.e. node.js
+ queuing mechanism

In-Memory Columnar DBs In-Memory
The New Memcached
• In memory DBs or
“Columnar” databases are
just key-value pairs:
put(‘name’, ‘value’)
• Some sophisticated layers
have been built on top to
turn it into near-SQL
• Crazy fast solution for read-
heavy systems like analytics
• Still needs workflow,
management, and a
traditional backend storage
system

Big Data Distribution Requirements Hosting
Massive Cheap Infrastructure
• Crazy virtual server farms!
1000+ servers get created and
destroyed to perform 1 job
• Automation and deployment of
these servers is crucial,
infrastructure automation is the
new hot skill
• Small-to-medium systems or
growing products use cloud
first and only invest in metal
once stabilized, and even then
it’s rarely cost effective
• Connections between the
servers drives the performance
of your data warehouse
solution!
Life in the Cloud:
It’s so different. Forget
everything you thought you
knew. Except Unix.
Major Bottlenecks:
• Reading & writing to disk: disks are
usually network-connected to cloud-
based servers
• Communicating with other servers
during replication; need to shave off
milliseconds with optimizations
• Staying ahead of storage space
limitations with archive jobs
• Partitioning large datasets based on
primary reduce filters
• Keeping up with your dataset when you
start to get behind

Coordination Strategies Implementation
Amazon Hosted Cloud
• It’s like a treasure hunt.*
Vsphere/Openstack Private Cloud
• Roll your own someday.
Cloudera Hadoop Management
• You will love life.
Chef Deployment Automation
• OMG life gets better.
* Note: Rackspace is also awesome.

Workflow Strategies Implementation
You Still Need Data In & Out
• Hive/Pig – You definitely need them
– SQL sits on top of Hadoop so you
can query flat files like a table!
– Outputs into RDBMS is easy, but
managing the jobs is hard
– Nobody wants to learn Map Reduce
• Custom Coding
– Long term supportability is low
– High cost & slow to market
• Cloudera has Workflow Services!
– Impala
– Flume
No one is really the standard in this space yet, although there are a lot
of really interesting players. Check out the Big Data chart for more!

The Big Data Adoption Problem Adoption
Problems:
• People don’t know what to do with the data
or how to gain insights
• The data changes too fast for traditional
software development; they don’t know what
they want until they see it, they can’t see it
until they tell you what they want!
• If it can’t feel the benefits of the
infrastructure, the business can’t continue to
invest in big data
Solutions:
• Open up windows into the workflow so
humans can dig around and discover things
in the data, teach everyone SQL
• Provide simple BI and visualization solutions
that don’t require custom development
• Support the classical Excel part of the
business world, and make your data
accessible in tabular exports
• Continue development on custom reporting
platforms, learning from the first three steps
along the way

Fancy Stuff Adoption
If you’ve come this far you can finally have…
• Statistical modeling
• Sentiment analysis
• Prediction algorithms
• Machine learning
• Mmmmmmm fun stuff…
BUT NOT UNTIL YOU CAN SUPPORT IT!!!! 

The Most Important Things to Know Cheatsheet
• It’s still all about the reads vs. writes
• HDFS is just a file system format for documents
• Hadoop is just for crunching and outputting into normal
databases, you don’t actually point an application at it
• MPPs are awesome and the wave of the future
• In-memory columnar databases are all the rage
(because they’re crazy fast) and will probably be a
requirement for all high-scale apps in the future
• Don’t forget to become awesome at Unix system &
network administration, because all the same commands
work in the cloud and it’s the only way to understand
what’s going on underneath the hood!

What to do right now Try it Out
• Download Openstack and install it on your laptop OR Register for
Amazon AWS
• In your new Cloud:
– Download & install Cloudera Community
– Spin up a few servers & add them to Cloudera
– Find Open Source Xtreme Data MPP in the Marketplace
• Get more advanced:
– Setup a Chef implementation, try automating a few server spin-
up & spin-downs
– Try the open source Druid in-memory DB
– Setup a node.js server w/ Express and pipe in some real-time
data
– Write a real-time data analytics front-end to see if it works!
• Where to get help?
– Forums are your best friend!
– IRC is your worst enemy but it’s still there for you!
– Wikipedia, Youtube, etc all have great resources to learn.

Hofstra University - Overview of Big Data

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to Hofstra University - Overview of Big Data

Similar to Hofstra University - Overview of Big Data (20)

Recently uploaded

Recently uploaded (20)

Hofstra University - Overview of Big Data

Editor's Notes