© 2014 Datameer, Inc. All rights reserved.
How to Avoid Pitfalls in 

Big Data Analytics"
View Recording ""

You can view the recording of this webinar
at:

http://info.datameer.com/Online-Slideshare-
How-to-Avoid-Pitfalls-in-Big-Data-
Analytics-OnDemand.html
© 2013 Datameer, Inc. All rights reserved.
Matt Schumpert @datameer
Senior Director, Solutions Engineering

Matt has been working in the enterprise
infrastructure software space for over 14 years in
various capacities, including sales engineering,
strategic alliances and consulting.

Matt currently runs the pre-sales engineering team at
Datameer, supporting all technical aspects of
customer engagement from initial contact through
roll-out of customers into production.

Matt holds a BS in Computer Science from the
University of Virginia. 
#datameer @datameer
About Our Speaker"
© 2013 Datameer, Inc. All rights reserved.
Dale Kim @MapR
Director, Product Marketing

Dale Kim is the Director of Product Marketing at
MapR.  His background includes a variety of technical
and management roles at information technology
companies. While his experience includes work with
relational databases, much of his career pertains to
non-relational data in the areas of search, content
management, and NoSQL.
 
Dale holds an MBA from Santa Clara University, and a
BA in Computer Science from the University of
California, Berkeley.
#mapr @mapr
About Our Speaker"
Agenda"
▪ Quick introduction to Hadoop
▪ Overview of analytics on Hadoop
▪ Quick tips on big data analytics
▪ Our 5 big data pitfalls to avoid
Quick Introduction to Apache Hadoop"
▪ What is Apache Hadoop
– Software framework for reliable, scalable,
distributed computing
– “Divide-and-conquer” approach to
processing large data sets
▪ Hadoop does analytics
– Hadoop is the platform of choice for big data
– If you have big data, then you are analyzing
big data
Types of Analytics for Hadoop"
▪ Descriptive – what happened, and why
– The “why” is also known as “diagnostic”
– Data mining, management reporting
Types of Analytics for Hadoop [2]"
▪ Predictive – what will happen
– Cross-sell/up-sell (recommendations), fraud/
anomaly detection
▪ Prescriptive – what should I do
– Preventative maintenance,

smart meter analysis
Better with
more data
Common Data Types for Hadoop"
▪ Clickstream/user behavior history
▪ Sensor/machine/event logs
▪ Social media profiles & communication
▪ Data warehouse data (structured, SoR)
▪ Long-tail/archive data
The Foundation for an Analytics Platform"
▪ Performance
– Make sure you get results in a timely manner
▪ Scalability
– Let your platform grow as your data grows
▪ Reliability
– Keep your users productive
▪ Ease-of-use
– Give users an end-to-end, self-service
platform that delivers fast time-to-insight
Quick Tips on Big Data Analytics"
▪  Minimize copying large data volumes across the wire
▪  Plan for production issues (system responsiveness,

performance, high availability, disaster recovery, audits)
▪  Start by looking for ways Hadoop can supplement, not
supplant your existing system
▪  Be wary of reusing a classic app. virtualization stack
▪  Choose "built-on”, not “connects-to" Hadoop vendors
▪  Be wary of lofty claims around machine learning (e.g.,
IBM Watson)
▪  As Hadoop in an emerging technology, pick innovative
rather than legacy vendors
Common Pitfalls in Big Data Implementations"
1. Incomplete plan for scaling up
2. Not architecting for maximum uptime
3. Over-use of immature technologies
4. Excessive/insufficient data governance
5. Wasting data scientists’ time with data
preparation
Incomplete Plan for Scaling Up"
RDBMS
VS.
•  Monolithic, RDBMS-based system
•  Vertical scaling
•  Large upgrade expenditure
•  Commodity server-based Hadoop system
•  Horizontal scaling
•  Incremental expenditure
Incomplete Plan for Scaling Up [2]"
▪ Relatively easy to extrapolate existing data
load to future
▪ But, must also factor in:
–  Larger time windows of data
•  Expanding beyond 3-month time window broke system
•  Now can store 18-months, results in more accurate
analytics
–  More data sources
•  Typically, new sources that could not be added before
–  More use cases and users
•  More divisions want to join system
Not Architecting for Maximum Uptime"
Separate user communities and data are isolated, but…
greater infrastructure complexity and risk
Not Architecting for Maximum Uptime [2]"
▪ Separate physical clusters for separate
“tenants” appears easy
▪ Multiple clusters lead to:
– Infrastructural complexity, more risk of error
– More points of failure
▪ Instead, leverage software components to
help logically separate users/data
Not Architecting for Maximum Uptime [3]"
▪ Global Storage Solutions Company
▪ Deployed file-serving HBase application
▪ Introduce ad-hoc analytics in same cluster
▪ No resource fencing, poor workload mgmt.
▪ Result: Significant downtime
Over-Use of Hadoop Ecosystem Technologies"
▪ Research group at a Fortune 500
▪ Anxious to deliver the first NoSQL project
▪ Built an overly complex data model
▪ Deployed HBase with no support/expertise
▪ Lack of integration/analytics = limited success
Excessive / Insufficient Data Governance"
▪ Under-Governed
–  Users deleting “unused data” after a project
–  Incorrectly interpreted as data loss by others
–  Result: panic
▪ Over-Governed
–  Fortune 500 deployed Hadoop as a shared IT service
–  Needed chargebacks based on data volume
–  Setup a “walled garden” for each project
–  Result: no sharing, no collaboration, fewer insights
Wasting Data Scientists’ Time with Data Prep"
▪ DS groups are often the first tenants on Hadoop
▪ Traditional DS tools are weak in data prep
▪ Hadoop tools like Pig unfamiliar to DS users
▪ Result: 80% of time spent on data wrangling
Demo …"
Datameer: Purpose-Built for Hadoop"
The #1 Data Discovery Platform"
Source: GigaOM, 03/14
MapR Distribution for Hadoop"
BIG
DATA
BEST
PRODUCT
BUSINESS
IMPACT
Hadoop
Top
Ranked
Production
Success
Look for our follow-up blog post at:
www.mapr.com/blog
The Power of the Open Source Community"Management
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
*	
  Cer&fica&on/support	
  planned	
  for	
  2014	
  
Projects to Follow"
▪ Apache Spark – fast, large-scale data
processing engine
– MapR is only distribution for Hadoop to
support the entire Spark stack
▪ Apache Drill – fast query execution engine
– MapR-initiated open source project
– Supports instant

querying and broad

data format support
For more information"
" http://www.datameer.com
" http://www.mapr.com


" @datameer
" @MapR
" mschumpert@datameer.com
" dalekim@mapr.com

Learn more
Contact
#datameer @datameer

How to Avoid Pitfalls in Big Data Analytics Webinar

  • 1.
    © 2014 Datameer,Inc. All rights reserved. How to Avoid Pitfalls in 
 Big Data Analytics"
  • 2.
    View Recording "" Youcan view the recording of this webinar at: http://info.datameer.com/Online-Slideshare- How-to-Avoid-Pitfalls-in-Big-Data- Analytics-OnDemand.html
  • 3.
    © 2013 Datameer,Inc. All rights reserved. Matt Schumpert @datameer Senior Director, Solutions Engineering Matt has been working in the enterprise infrastructure software space for over 14 years in various capacities, including sales engineering, strategic alliances and consulting. Matt currently runs the pre-sales engineering team at Datameer, supporting all technical aspects of customer engagement from initial contact through roll-out of customers into production. Matt holds a BS in Computer Science from the University of Virginia.  #datameer @datameer About Our Speaker"
  • 4.
    © 2013 Datameer,Inc. All rights reserved. Dale Kim @MapR Director, Product Marketing Dale Kim is the Director of Product Marketing at MapR.  His background includes a variety of technical and management roles at information technology companies. While his experience includes work with relational databases, much of his career pertains to non-relational data in the areas of search, content management, and NoSQL.   Dale holds an MBA from Santa Clara University, and a BA in Computer Science from the University of California, Berkeley. #mapr @mapr About Our Speaker"
  • 5.
    Agenda" ▪ Quick introduction toHadoop ▪ Overview of analytics on Hadoop ▪ Quick tips on big data analytics ▪ Our 5 big data pitfalls to avoid
  • 6.
    Quick Introduction toApache Hadoop" ▪ What is Apache Hadoop – Software framework for reliable, scalable, distributed computing – “Divide-and-conquer” approach to processing large data sets ▪ Hadoop does analytics – Hadoop is the platform of choice for big data – If you have big data, then you are analyzing big data
  • 7.
    Types of Analyticsfor Hadoop" ▪ Descriptive – what happened, and why – The “why” is also known as “diagnostic” – Data mining, management reporting
  • 8.
    Types of Analyticsfor Hadoop [2]" ▪ Predictive – what will happen – Cross-sell/up-sell (recommendations), fraud/ anomaly detection ▪ Prescriptive – what should I do – Preventative maintenance,
 smart meter analysis Better with more data
  • 9.
    Common Data Typesfor Hadoop" ▪ Clickstream/user behavior history ▪ Sensor/machine/event logs ▪ Social media profiles & communication ▪ Data warehouse data (structured, SoR) ▪ Long-tail/archive data
  • 10.
    The Foundation foran Analytics Platform" ▪ Performance – Make sure you get results in a timely manner ▪ Scalability – Let your platform grow as your data grows ▪ Reliability – Keep your users productive ▪ Ease-of-use – Give users an end-to-end, self-service platform that delivers fast time-to-insight
  • 11.
    Quick Tips onBig Data Analytics" ▪  Minimize copying large data volumes across the wire ▪  Plan for production issues (system responsiveness,
 performance, high availability, disaster recovery, audits) ▪  Start by looking for ways Hadoop can supplement, not supplant your existing system ▪  Be wary of reusing a classic app. virtualization stack ▪  Choose "built-on”, not “connects-to" Hadoop vendors ▪  Be wary of lofty claims around machine learning (e.g., IBM Watson) ▪  As Hadoop in an emerging technology, pick innovative rather than legacy vendors
  • 12.
    Common Pitfalls inBig Data Implementations" 1. Incomplete plan for scaling up 2. Not architecting for maximum uptime 3. Over-use of immature technologies 4. Excessive/insufficient data governance 5. Wasting data scientists’ time with data preparation
  • 13.
    Incomplete Plan forScaling Up" RDBMS VS. •  Monolithic, RDBMS-based system •  Vertical scaling •  Large upgrade expenditure •  Commodity server-based Hadoop system •  Horizontal scaling •  Incremental expenditure
  • 14.
    Incomplete Plan forScaling Up [2]" ▪ Relatively easy to extrapolate existing data load to future ▪ But, must also factor in: –  Larger time windows of data •  Expanding beyond 3-month time window broke system •  Now can store 18-months, results in more accurate analytics –  More data sources •  Typically, new sources that could not be added before –  More use cases and users •  More divisions want to join system
  • 15.
    Not Architecting forMaximum Uptime" Separate user communities and data are isolated, but… greater infrastructure complexity and risk
  • 16.
    Not Architecting forMaximum Uptime [2]" ▪ Separate physical clusters for separate “tenants” appears easy ▪ Multiple clusters lead to: – Infrastructural complexity, more risk of error – More points of failure ▪ Instead, leverage software components to help logically separate users/data
  • 17.
    Not Architecting forMaximum Uptime [3]" ▪ Global Storage Solutions Company ▪ Deployed file-serving HBase application ▪ Introduce ad-hoc analytics in same cluster ▪ No resource fencing, poor workload mgmt. ▪ Result: Significant downtime
  • 18.
    Over-Use of HadoopEcosystem Technologies" ▪ Research group at a Fortune 500 ▪ Anxious to deliver the first NoSQL project ▪ Built an overly complex data model ▪ Deployed HBase with no support/expertise ▪ Lack of integration/analytics = limited success
  • 19.
    Excessive / InsufficientData Governance" ▪ Under-Governed –  Users deleting “unused data” after a project –  Incorrectly interpreted as data loss by others –  Result: panic ▪ Over-Governed –  Fortune 500 deployed Hadoop as a shared IT service –  Needed chargebacks based on data volume –  Setup a “walled garden” for each project –  Result: no sharing, no collaboration, fewer insights
  • 20.
    Wasting Data Scientists’Time with Data Prep" ▪ DS groups are often the first tenants on Hadoop ▪ Traditional DS tools are weak in data prep ▪ Hadoop tools like Pig unfamiliar to DS users ▪ Result: 80% of time spent on data wrangling
  • 21.
  • 22.
  • 23.
    The #1 DataDiscovery Platform" Source: GigaOM, 03/14
  • 24.
    MapR Distribution forHadoop" BIG DATA BEST PRODUCT BUSINESS IMPACT Hadoop Top Ranked Production Success Look for our follow-up blog post at: www.mapr.com/blog
  • 25.
    The Power ofthe Open Source Community"Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue *  Cer&fica&on/support  planned  for  2014  
  • 26.
    Projects to Follow" ▪ ApacheSpark – fast, large-scale data processing engine – MapR is only distribution for Hadoop to support the entire Spark stack ▪ Apache Drill – fast query execution engine – MapR-initiated open source project – Supports instant
 querying and broad
 data format support
  • 28.
    For more information" "http://www.datameer.com " http://www.mapr.com " @datameer " @MapR " mschumpert@datameer.com " dalekim@mapr.com Learn more Contact #datameer @datameer