Big data on a small budget
What do I know about big data?

- skobbler logs all positions
from our users (100 billion+)
- > 10TB of data from users
- Products / revenues
significantly Improved with
Business Intelligence
Big data on a small budget

@apphil #2
Why should you learn about big data?

 Harvard Business Review: “Data Scientist: The
Sexiest Job of the 21st Century”
 Obama became president of the US in big parts
due to the use of big data…
 World class sports teams enhance their
performance by big data

 Amazon, Google, Facebook, etc. have all their devprocesses by now data-driven

Big data on a small budget

@apphil #3
What are some great use-cases for big
data?
 Analyzing of log files
and user behavior (and
predictions about future
behavior)
 A/B testing and
automatic optimization
of functionality
 Improving monetization
(e.g. ad optimization,
etc.)

 Checking adoption and
usage of new features
Big data on a small budget

@apphil #4
When better not to rely on big data?
 When qualitative feedback is
better than quantitative one
(e.g. very early stage
companies)
 When you don’t have
enough users yet to get
statistically relevant results
 When you do not know what
you are optimizing for

Big data on a small budget

@apphil #5
How does a solid and simple workflow for
big data analysis look like?
Proces
s

Log

Analyse

Eval /
Test

Big data on a small budget

Improv
e

@apphil #6
Tools / technologies for a good big data
setup
 Logging: MongoDB, VoltDB,
Cassandra
 Processing & Analyzing /
Storing: Hadoop & Hbase
(batch), Storm (real-time),
Samza (real-time)
 Optimizing: Mahout (machine
learning)

Big data on a small budget

@apphil #7
How can you build this without breaking
the bank?

- Analyse / process Async
- Cheap dedicated servers
(vs. cloud)
- Use Open / Free
Software
Big data on a small budget

@apphil #8
Key cost factor: Real-time, near-time vs.
batch

- Real-time much more
expensive than batch
- Leverage as much preprocessing as possible
- Try using in-memory
technology for realtime analytics
Big data on a small budget

@apphil #9
#1 Log: Initially as much data as feasible
should be logged so it’s available later

- Define interesting data
(rather log too much if
unsure)
- Upload / collect data
- Decide on real-time, neartime or batch processing in
the chain
Big data on a small budget

@apphil #10
#2 Process: Enhance the data and make it
as rich as possible and easy to query

- Move data to processing environment
- Run logged data through processing
chain so it can be queried
- Enhance the logged data with any
additional data available (e.g.
geography, social data, user data, etc.)
Big data on a small budget

@apphil
#3 Analyse: Cluster the data in meaningful
groups and compare it

Big data on a small budget

- Define Key performance
Indicators (KPI)
- Cluster data in a meaningful
way (e.g. by geography, time
of day, customer past
behaviour)
- Compare data vs. reference
sets
@apphil #12
#4 Improve: Learn from analysis where
your challenges are to optimize behavior

- Manually / Automatically adjust
features (e.g. lower prices in
certain regions, etc.)
- Develop A/B testing scenarios
and formulate improvement
theories
Big data on a small budget

@apphil #13
#5 Evaluate
 Check if the KPIs
improve after applying
the changes
 Accept changes that
improved your users
behavior / reject changes
that kept them the same
 Define which additional
logs you might need to
better cluster / identify
behaviour

 Go back to step #1

Big data on a small budget

@apphil #14
#1 Log: Practical example on how this
works at skobbler
 Software version
 Routing profile used

 Device
 Raw Positions
 Geography (e.g. country)

 Rating of the route (optional)
 Destination reached (yes / no)
 Etc.
Big data on a small budget

@apphil #15
#2 Process: Enhance and split the data
based on drives and segments
 Combine the data on a per drive basis (= session)
 Combine the data on a per segment basis (= how
fast are people driving on a street versus our
estimate)
 Identify key behavior across the route (e.g. reroutings, etc.)

Big data on a small budget

@apphil #16
Example: Real time analysis with Twitter
Storm framework to detect road changes

Example visualization of
drives in last five
minutes (real-time)
Big data on a small budget

@apphil #17
Example: Historic driving patterns
(processed with Hadoop / HBase)

Big data on a small budget

@apphil #18
#3 Analyse: Try to see in which areas our
routing is not optimal
 KPIs are:
 Route rating (if given)

 # of re-routings (the smaller the better)
 Time to destination vs. estimation by routing
 Cluster the data by

 Routing algorithm (and parameters used)
 Geography

Big data on a small budget

@apphil #19
#4 Improve: Come up with strategies to
improve routing experience based on data
 For future routes improve the estimation on time
taken on a segment vs. time actually travelled
 Alter routing parameters based on country specifics
to get better results (e.g. in Germany people drive
faster on the Autobahn)

Big data on a small budget

@apphil #20
#5 Evaluate: Deploy the changes and
compare them to reference data

- Deploy changes to production
and compare ratings / timings
vs. base values (~weekly)
- Verify if other parameters such
as usage, etc. also improve
Big data on a small budget

@apphil #21
Summary: Big data can drive big value but
stay affordable

Simple formula:
Log -> Process -> Analyze ->
Improve -> Evaluate
= Success

Big data on a small budget

@apphil #22
Thank you for your attention!
Get in Touch: philipp.kandal@skobbler.com
Phone: +49-172-4597015
Follow me on
.com/apphil

Philipp Kandal , CTO, Skobbler - Big data on a small budget

  • 1.
    Big data ona small budget
  • 2.
    What do Iknow about big data? - skobbler logs all positions from our users (100 billion+) - > 10TB of data from users - Products / revenues significantly Improved with Business Intelligence Big data on a small budget @apphil #2
  • 3.
    Why should youlearn about big data?  Harvard Business Review: “Data Scientist: The Sexiest Job of the 21st Century”  Obama became president of the US in big parts due to the use of big data…  World class sports teams enhance their performance by big data  Amazon, Google, Facebook, etc. have all their devprocesses by now data-driven Big data on a small budget @apphil #3
  • 4.
    What are somegreat use-cases for big data?  Analyzing of log files and user behavior (and predictions about future behavior)  A/B testing and automatic optimization of functionality  Improving monetization (e.g. ad optimization, etc.)  Checking adoption and usage of new features Big data on a small budget @apphil #4
  • 5.
    When better notto rely on big data?  When qualitative feedback is better than quantitative one (e.g. very early stage companies)  When you don’t have enough users yet to get statistically relevant results  When you do not know what you are optimizing for Big data on a small budget @apphil #5
  • 6.
    How does asolid and simple workflow for big data analysis look like? Proces s Log Analyse Eval / Test Big data on a small budget Improv e @apphil #6
  • 7.
    Tools / technologiesfor a good big data setup  Logging: MongoDB, VoltDB, Cassandra  Processing & Analyzing / Storing: Hadoop & Hbase (batch), Storm (real-time), Samza (real-time)  Optimizing: Mahout (machine learning) Big data on a small budget @apphil #7
  • 8.
    How can youbuild this without breaking the bank? - Analyse / process Async - Cheap dedicated servers (vs. cloud) - Use Open / Free Software Big data on a small budget @apphil #8
  • 9.
    Key cost factor:Real-time, near-time vs. batch - Real-time much more expensive than batch - Leverage as much preprocessing as possible - Try using in-memory technology for realtime analytics Big data on a small budget @apphil #9
  • 10.
    #1 Log: Initiallyas much data as feasible should be logged so it’s available later - Define interesting data (rather log too much if unsure) - Upload / collect data - Decide on real-time, neartime or batch processing in the chain Big data on a small budget @apphil #10
  • 11.
    #2 Process: Enhancethe data and make it as rich as possible and easy to query - Move data to processing environment - Run logged data through processing chain so it can be queried - Enhance the logged data with any additional data available (e.g. geography, social data, user data, etc.) Big data on a small budget @apphil
  • 12.
    #3 Analyse: Clusterthe data in meaningful groups and compare it Big data on a small budget - Define Key performance Indicators (KPI) - Cluster data in a meaningful way (e.g. by geography, time of day, customer past behaviour) - Compare data vs. reference sets @apphil #12
  • 13.
    #4 Improve: Learnfrom analysis where your challenges are to optimize behavior - Manually / Automatically adjust features (e.g. lower prices in certain regions, etc.) - Develop A/B testing scenarios and formulate improvement theories Big data on a small budget @apphil #13
  • 14.
    #5 Evaluate  Checkif the KPIs improve after applying the changes  Accept changes that improved your users behavior / reject changes that kept them the same  Define which additional logs you might need to better cluster / identify behaviour  Go back to step #1 Big data on a small budget @apphil #14
  • 15.
    #1 Log: Practicalexample on how this works at skobbler  Software version  Routing profile used  Device  Raw Positions  Geography (e.g. country)  Rating of the route (optional)  Destination reached (yes / no)  Etc. Big data on a small budget @apphil #15
  • 16.
    #2 Process: Enhanceand split the data based on drives and segments  Combine the data on a per drive basis (= session)  Combine the data on a per segment basis (= how fast are people driving on a street versus our estimate)  Identify key behavior across the route (e.g. reroutings, etc.) Big data on a small budget @apphil #16
  • 17.
    Example: Real timeanalysis with Twitter Storm framework to detect road changes Example visualization of drives in last five minutes (real-time) Big data on a small budget @apphil #17
  • 18.
    Example: Historic drivingpatterns (processed with Hadoop / HBase) Big data on a small budget @apphil #18
  • 19.
    #3 Analyse: Tryto see in which areas our routing is not optimal  KPIs are:  Route rating (if given)  # of re-routings (the smaller the better)  Time to destination vs. estimation by routing  Cluster the data by  Routing algorithm (and parameters used)  Geography Big data on a small budget @apphil #19
  • 20.
    #4 Improve: Comeup with strategies to improve routing experience based on data  For future routes improve the estimation on time taken on a segment vs. time actually travelled  Alter routing parameters based on country specifics to get better results (e.g. in Germany people drive faster on the Autobahn) Big data on a small budget @apphil #20
  • 21.
    #5 Evaluate: Deploythe changes and compare them to reference data - Deploy changes to production and compare ratings / timings vs. base values (~weekly) - Verify if other parameters such as usage, etc. also improve Big data on a small budget @apphil #21
  • 22.
    Summary: Big datacan drive big value but stay affordable Simple formula: Log -> Process -> Analyze -> Improve -> Evaluate = Success Big data on a small budget @apphil #22
  • 23.
    Thank you foryour attention! Get in Touch: philipp.kandal@skobbler.com Phone: +49-172-4597015 Follow me on .com/apphil