The Data Science Lab:
Enabling Flexible,
Complex Analytics
on a Single Platform
@Kognitio
#DataSci
Follow the conversation on Twitter:
• Thank you for joining today’s session!
• The web briefing will start momentarily.
Slides available NOW at www.slideshare.net/kognitio
Teleconference:
Use your computer, or call:
US +1 631 267 4890
Toll-Free 1-855-299-5224
Passcode: 841 203 797
Other global Dial-in numbers available at:
https://kognitio.webex.com/kognitio/globalcallin.php
- Web Briefing -
The Data Science Lab:
Enabling Flexible, Complex Analytics
@Kognitio #DataSciFollow the conversation on Twitter:
Today’s call will use the
WebEx Q & A feature
@Kognitio #DataSci@Kognitio #DataSci
Enabling Flexible, Complex Analytics
on a single platform
The Data Science Lab: Enabling Flexibility
Demonstrations
Summary, Question & Answer Session
Presenters: 
‐ Dr. Sharon Kirkham, Data Scientist
‐ Michael Hiskey, Product Evangelist
Web Briefing
The Data Science Lab
@Kognitio
#DataSci
Follow the conversation
on Twitter:
3
@Kognitio #DataSci@Kognitio #DataSci
Enabling Flexible, Complex Analytics
on a single platform
July 25, 2013
1. Data Accessibility
• Hadoop
• Data Mash‐Up
2. Analytical Productivity
• MPP in‐memory code execution
• R scripts with MPP
3. “Graduate” Projects to B.A.U.
• Data Science and the Business
Use Case Scenarios:
The Data Science Lab
POLL
@Kognitio #DataSci@Kognitio #DataSci
Flexible Platform for Big Data Analytics
Flexible data
access
Flexible
processing
Flexible
deployment
options
Near-line
Storage
(optional)
All BI Tools All OLAP Clients Excel
Hadoop
Clusters
Enterprise Data
Warehouses
Legacy
Systems
Kognitio
Storage
Reporting
Cloud
Storage
Analytical
Platform
Layer
5
Mature Business Intelligence & Reporting
Numbers, tables, charts, indicators
…accessed with ease and simplicity
Historical information, latency
BI tools have plateaued
Decision Support
Advanced analytics and data science
More math…a lot more math
6
The Analytical Enterprise
Business
Analyst
Systems
Admin
Data
Scientist
Sexiest job of the 21st Century?
Key: “Graduation”
• Projects will need to easily Graduate
from the Data Science Lab and
become part of Business as Usual
7
@Kognitio #DataSci@Kognitio #DataSci
Telling a story with data
Build, tune and run
complex data projects
Dealing with big data
from multiple sources
Must overcome IT
bottlenecks
Source: http://www.emc.com/microsites/bigdata/infographic.htm
Data scientists are
in demand:
8
@Kognitio #DataSci@Kognitio #DataSci
Scenario 1: Data Accessibility
”… this exercise is to identify if
improvements in data preparation can
make a significant difference to the
productivity and earning capacity of our
analytics team”
- Global Digital marketing analytics firm
source: http://newvantage.com/wp-content/uploads/2012/12/NVP-Big-Data-Survey-Themes-Trends.pdf
POLL
SQL querying on
Hadoop
Scenario 1: Data Accessibility
@Kognitio #DataSci@Kognitio #DataSci
Summary: Data Accessibility
Kognitio Hadoop Integration
• Map/Reduce agent dynamically executes on
all Hadoop nodes
• Query passes selections, relevant predicates
to the agents
• Data filtering & projection locally on each node
• Data filtered as it is read from file(s)
• Only data of interest is transferred and loaded
into memory via parallel load streams
Hadoop
Clusters
Enterprise Data
Warehouses
Legacy
Systems
Kognitio
Storage
Reporting
Cloud
Storage
11
@Kognitio #DataSci@Kognitio #DataSci
Scenario 2: Analytical Productivity
“…want to see a significant
improvement in the analytical
throughput … from current
time frame of 2 weeks … to
no more than 1 day”
- A marketing science analytics company
“…we run much of our analytics
on a 5% sample of the data. We
want to be able to run on 100%
of the data in the same time as
the 5% sample.”
- A leading Ad Agency
Source: http://www.wired.com/insights/2013/07/the-new-horizon-for-bi-and-analytics/
POLL
12
Massively parallel in-
memory code execution
Scenario 2: Analytical Productivity
@Kognitio #DataSci@Kognitio #DataSci
MPP in-memory code execution
NoSQL external scripting function:
• SQL provides standard data access framework
– Open, adaptable framework; pass data to/from any
executable or interpreter
– Fully flexible MPP execution of R, Python, Java, text
parsing libraries etc.
create interpreter perlinterp
command '/usr/bin/perl' sends 'csv' receives 'csv' ;
select top 1000 words, count(*)
from (external script using environment perlinterp
receives (txt varchar(32000))
sends (words varchar(100))
script S'endofperl(
while(<>)
{
chomp();
s/[,.!_]//g;
foreach $c (split(/ /))
{ if($c =~ /^[a-zA-Z]+$/) { print "$cn”} }
}
)endofperl'
from (select comments from customer_enquiry))dt
group by 1
order by 2 desc;
From the Demo:
This reads long comments text from
customer enquiry table, in line Perl
converts long text into output stream
of words (one word per row), query
selects top 1000 words by frequency
using standard SQL aggregation
Accessing Analytics
across the business
Scenario #3: Barriers to Deployment
@Kognitio #DataSci@Kognitio #DataSci
An Ideal Deployment Scenario
Cloud model can provide a way to quickly
model, experiment, develop and build
• Deploy to existing reporting tools
• Pass ownership to IT
• Cloud instances can be “temporary”
• Repeatable framework
2011 2010 Sep.3
Aug. Jul. Sep. Aug.
3,443,873 8.1 382,009 401,951 391,878 351,696 369,199
617,194 10.4 67,055 71,725 69,801 61,676 66,085
65,237 1.0 7,671 7,892 7,422 7,357 7,611
70,324 0.0 7,737 8,240 7,888 7,685 8,082
226,261 5.8 24,764 26,196 25,973 23,288 23,722
455,276 5.6 50,418 52,164 53,062 47,710 48,597
446,918 3.5 48,368 51,797 51,160 46,166 49,848
88,590 8.7 10,510 10,681 10,258 9,591 9,514
279,985 13.2 31,390 31,889 28,478 28,266 28,282
368,372 5.5 41,188 42,244 43,097 37,992 40,228
Not Adjusted
9 Month Total 2011 2010
*
Business 
Analyst
Business 
User
IT Admin
Data 
Scientist
PRESS
HERE
PRESS
HERE…and really cool Big Data stuff happens!
16
@Kognitio #DataSci@Kognitio #DataSci
It’s all about flexibility
Flexible data
access
Flexible
processing
Flexible
deployment
options
Near-line
Storage
(optional)
All BI Tools All OLAP Clients Excel
Hadoop
Clusters
Enterprise Data
Warehouses
Legacy
Systems
Kognitio
Storage
Reporting
Cloud
Storage
17
Question & Answer session will be conducted electronically,
using the panel to the right of your screen
Learn more, Stay connected:
Free Download
kognitio.com/GoTryIt
Request a Meeting
kognitio.com/meeting
Take the Survey
kognitio.com/DSL
The Data Science Lab:
Enabling Flexible, Complex Analytics

Data science lab enabling flexibility

  • 1.
    The Data ScienceLab: Enabling Flexible, Complex Analytics on a Single Platform @Kognitio #DataSci Follow the conversation on Twitter:
  • 2.
    • Thank youfor joining today’s session! • The web briefing will start momentarily. Slides available NOW at www.slideshare.net/kognitio Teleconference: Use your computer, or call: US +1 631 267 4890 Toll-Free 1-855-299-5224 Passcode: 841 203 797 Other global Dial-in numbers available at: https://kognitio.webex.com/kognitio/globalcallin.php - Web Briefing - The Data Science Lab: Enabling Flexible, Complex Analytics @Kognitio #DataSciFollow the conversation on Twitter: Today’s call will use the WebEx Q & A feature
  • 3.
    @Kognitio #DataSci@Kognitio #DataSci EnablingFlexible, Complex Analytics on a single platform The Data Science Lab: Enabling Flexibility Demonstrations Summary, Question & Answer Session Presenters:  ‐ Dr. Sharon Kirkham, Data Scientist ‐ Michael Hiskey, Product Evangelist Web Briefing The Data Science Lab @Kognitio #DataSci Follow the conversation on Twitter: 3
  • 4.
    @Kognitio #DataSci@Kognitio #DataSci EnablingFlexible, Complex Analytics on a single platform July 25, 2013 1. Data Accessibility • Hadoop • Data Mash‐Up 2. Analytical Productivity • MPP in‐memory code execution • R scripts with MPP 3. “Graduate” Projects to B.A.U. • Data Science and the Business Use Case Scenarios: The Data Science Lab POLL
  • 5.
    @Kognitio #DataSci@Kognitio #DataSci FlexiblePlatform for Big Data Analytics Flexible data access Flexible processing Flexible deployment options Near-line Storage (optional) All BI Tools All OLAP Clients Excel Hadoop Clusters Enterprise Data Warehouses Legacy Systems Kognitio Storage Reporting Cloud Storage Analytical Platform Layer 5
  • 6.
    Mature Business Intelligence& Reporting Numbers, tables, charts, indicators …accessed with ease and simplicity Historical information, latency BI tools have plateaued Decision Support Advanced analytics and data science More math…a lot more math 6
  • 7.
    The Analytical Enterprise Business Analyst Systems Admin Data Scientist Sexiestjob of the 21st Century? Key: “Graduation” • Projects will need to easily Graduate from the Data Science Lab and become part of Business as Usual 7
  • 8.
    @Kognitio #DataSci@Kognitio #DataSci Tellinga story with data Build, tune and run complex data projects Dealing with big data from multiple sources Must overcome IT bottlenecks Source: http://www.emc.com/microsites/bigdata/infographic.htm Data scientists are in demand: 8
  • 9.
    @Kognitio #DataSci@Kognitio #DataSci Scenario1: Data Accessibility ”… this exercise is to identify if improvements in data preparation can make a significant difference to the productivity and earning capacity of our analytics team” - Global Digital marketing analytics firm source: http://newvantage.com/wp-content/uploads/2012/12/NVP-Big-Data-Survey-Themes-Trends.pdf POLL
  • 10.
    SQL querying on Hadoop Scenario1: Data Accessibility
  • 11.
    @Kognitio #DataSci@Kognitio #DataSci Summary:Data Accessibility Kognitio Hadoop Integration • Map/Reduce agent dynamically executes on all Hadoop nodes • Query passes selections, relevant predicates to the agents • Data filtering & projection locally on each node • Data filtered as it is read from file(s) • Only data of interest is transferred and loaded into memory via parallel load streams Hadoop Clusters Enterprise Data Warehouses Legacy Systems Kognitio Storage Reporting Cloud Storage 11
  • 12.
    @Kognitio #DataSci@Kognitio #DataSci Scenario2: Analytical Productivity “…want to see a significant improvement in the analytical throughput … from current time frame of 2 weeks … to no more than 1 day” - A marketing science analytics company “…we run much of our analytics on a 5% sample of the data. We want to be able to run on 100% of the data in the same time as the 5% sample.” - A leading Ad Agency Source: http://www.wired.com/insights/2013/07/the-new-horizon-for-bi-and-analytics/ POLL 12
  • 13.
    Massively parallel in- memorycode execution Scenario 2: Analytical Productivity
  • 14.
    @Kognitio #DataSci@Kognitio #DataSci MPPin-memory code execution NoSQL external scripting function: • SQL provides standard data access framework – Open, adaptable framework; pass data to/from any executable or interpreter – Fully flexible MPP execution of R, Python, Java, text parsing libraries etc. create interpreter perlinterp command '/usr/bin/perl' sends 'csv' receives 'csv' ; select top 1000 words, count(*) from (external script using environment perlinterp receives (txt varchar(32000)) sends (words varchar(100)) script S'endofperl( while(<>) { chomp(); s/[,.!_]//g; foreach $c (split(/ /)) { if($c =~ /^[a-zA-Z]+$/) { print "$cn”} } } )endofperl' from (select comments from customer_enquiry))dt group by 1 order by 2 desc; From the Demo: This reads long comments text from customer enquiry table, in line Perl converts long text into output stream of words (one word per row), query selects top 1000 words by frequency using standard SQL aggregation
  • 15.
    Accessing Analytics across thebusiness Scenario #3: Barriers to Deployment
  • 16.
    @Kognitio #DataSci@Kognitio #DataSci AnIdeal Deployment Scenario Cloud model can provide a way to quickly model, experiment, develop and build • Deploy to existing reporting tools • Pass ownership to IT • Cloud instances can be “temporary” • Repeatable framework 2011 2010 Sep.3 Aug. Jul. Sep. Aug. 3,443,873 8.1 382,009 401,951 391,878 351,696 369,199 617,194 10.4 67,055 71,725 69,801 61,676 66,085 65,237 1.0 7,671 7,892 7,422 7,357 7,611 70,324 0.0 7,737 8,240 7,888 7,685 8,082 226,261 5.8 24,764 26,196 25,973 23,288 23,722 455,276 5.6 50,418 52,164 53,062 47,710 48,597 446,918 3.5 48,368 51,797 51,160 46,166 49,848 88,590 8.7 10,510 10,681 10,258 9,591 9,514 279,985 13.2 31,390 31,889 28,478 28,266 28,282 368,372 5.5 41,188 42,244 43,097 37,992 40,228 Not Adjusted 9 Month Total 2011 2010 * Business  Analyst Business  User IT Admin Data  Scientist PRESS HERE PRESS HERE…and really cool Big Data stuff happens! 16
  • 17.
    @Kognitio #DataSci@Kognitio #DataSci It’sall about flexibility Flexible data access Flexible processing Flexible deployment options Near-line Storage (optional) All BI Tools All OLAP Clients Excel Hadoop Clusters Enterprise Data Warehouses Legacy Systems Kognitio Storage Reporting Cloud Storage 17
  • 18.
    Question & Answersession will be conducted electronically, using the panel to the right of your screen Learn more, Stay connected: Free Download kognitio.com/GoTryIt Request a Meeting kognitio.com/meeting Take the Survey kognitio.com/DSL The Data Science Lab: Enabling Flexible, Complex Analytics