End User Panel on Real-Time Data Analytics
Building Predictive Applications with
Real-Time Data Pipelines and Streamliner
Eric Frenkiel, CEO and Co-Founder, MemSQL
Going Real-Time is the Next Phase for Big Data
More
Devices
More
Interconnectivity
More
User Demand
…and companies are at risk of being left behind
MemSQL Architecture
St ream in g Da ta W areh o u se
Streaming
Integrated streaming
with Streamliner
Database
High volume transactions
for structured and
unstructured data
Data Warehouse
Fast, scalable
SQL for immediate
analytics
Applications and Technology Trends
Real-Time Analytics Risk-Management Personalization
Portfolio Tracking
Monitoring and
Detection
Internet of Things | Real-Time Data Pipelines | Operationalizing Apache Spark
Put Apache Spark in the fast lane.
Persist. Perform. Perfect.
Changing the Way the World Invests
Noah Zucker, Vice President – Tactical Engineering, Novus Partners
Scalable Portfolio Intelligence with MemSQL
 100+ Investment Managers, $2 Trillion AUM
 Research Platform: 10,000+ Institutions
 Founded 2007, Privately Held
We help investors discover their true
investment acumen and risk
About Novus
True Investment Acumen and Risk…at Scale
Top-Tier Client List
 24/7 ETL Handholding
 Overnight Failure =
Business Hours Slowdown
 Scala worker pool limited
by the database
 Non-trivial code changes
needed to shard and scale
Before MemSQL…
Today’s Portfolio Intelligence…Right Now
Before MemSQL:
With MemSQL:
90 Min.
2 Min.
Customer
Data
Persistent
StoreETL Analytics
(Scala)
First-Class JSON Support…Happy Developers
memsql> select * from tasks t where t.task::uid::%clientId = 7;
+---------+---------------------------------------------------------------+
| task_id | task |
+---------+---------------------------------------------------------------+
| 3 | {"uid":{"clientId":7,"id":1009,"which":"P"},"user":"noahlz"} |
+---------+---------------------------------------------------------------+
1 row in set (0.00 sec)
Salat
 Client team focuses on
service, not ETL
 Predictable application
performance
 Scala workers: 12  126
 Add servers to scale –
No code changes needed
With MemSQL…
http://www.novus.com
http://tech.novus.com
@NovusCode
Ian Hansen, Software Engineering Manager
Digital Ocean
ETL Tools for Small Teams
Problem: Business Intelligence Slows as We Grow
 Data lives in SQL
 Easy to ask new questions in SQL
 But… Business Intelligence tasks taking longer
 Database isn’t built for quick aggregations
Solution: Scale-out SQL Database
 SQL team stays powerful
 Quick to iterate with quick answers
 Prepare for the future!
Problem: Data isn’t in MemSQL
Plus
 You don’t have an engineer on
your team
 It’s hard to get an engineer’s time
 You’ve got a job to do…
(which is taking more and more
time)
Solution: ETL Using REPLACE INTO
 MySQL SQL flavor (available in MemSQL)
 Handles new rows and updates on rows
 Easy to write
• Query source database then replace into target database
 Many other scale-out SQL databases don’t have
equivalent
Problem: Now Load JSON Event Data
 ~300K events per day
 Many different types of JSON events
Solution: MemSQL Loader + JSON Type
 Only loads new files (or files
whose content has changed)
 Parallelizes the process
 Transformation script
simple: return id and raw json data
 SQL team unaffected by new
JSON events
./memsql-loader load /opt/events/**
--table events
--script=/opt/events-etl
--file-id-column file_id
--columns id,data
Problem: Processing Data on Select
 Need computed value in SQL query
 Computing the value slows down queries
 Computed value used on many queries
• e.g. domain from a URL string
Solution: Persistent Columns
 Pre-compute result and
save it on the row
 Automatically updated if
row changes
 No need to alter ETL
pipeline
ALTER TABLE events
ADD COLUMN (
referring_domain AS
substring_index(substring(data::$re
ferrer, (locate('//',
data::$referrer)) + 2), '/', 1)
PERSISTED varchar(255)
)
Solution: Persistent Columns
Use pre-computed value in select
memsql> select data, referring_domain from events limit 2;
+-------------------------------------+------------------+
| data | referring_domain |
+-------------------------------------+------------------+
| {"referrer":"http://example.com/b"} | example.com |
| {"referrer":"http://example.com/a"} | example.com |
+-------------------------------------+------------------+
Tools
 REPLACE INTO syntax
 JSON native type
 MemSQL Loader
 Persistent columns
 Now, MemSQL Streamliner
We Want More Data
We are Hiring
Mike DePrizio, Senior Architect, Akamai Technologies
Unlocking Revenue with In-Memory Technology
We are the leading provider of
cloud services for delivering,
optimizing and securing online
content and business applications
$1.96B
Revenue
1,300
Locations
5,000+
Customers
5,100+
Employees
CORPORATE STATS (2014):
OUR HISTORY:
Founded 1998 and rooted in MIT
technology—solving Internet
congestion with math not hardware
The Business of Billing
Billing domino effect
 Akamai  Customers  Sub-customers
Daily billing requires:
 Fast data delivery
 Accurate data
Old Model New Model
Generating a bill at end of month for
customer services
Generating a bill at the end of every
day for sub-customer services
Current Billing Data Management
Gather logs from 190,000+ servers in 1400 locations in 110
countries
 Multiple PBs/day aggregate/reduce into relevant billing data feed
 Typical data record: 3 key fields plus metrics
 Load resulting data record into our RDBMS system
Greatest Challenges
 Current system cannot handle expected throughput
 Difficult to quickly scale up existing environments
 New model will generate 10x+ data
Deploying MemSQL
Application
Daily Sub-customer billing
Problem
Existing RDMS pipeline loads were maxed out at 150-
300K upserts/second, could not keep up with projected
size of new billing model
Results
MemSQL cluster performs at 1.9
million upserts/second, allowing
transition from monthly to daily billing
Billing Data resource
usage statistics
INSERT... ON
DUPLICATE KEY
UPDATE...
(1.9 million/sec)
Billing Application
• Compute sub-customer
charges daily
• Roll up sub-customer usage by
customer/cloud provider
• More sophisticated platform
offers customers better
service, partners new business
opportunities
Results Speak for Themselves
 2M upserts/second on AWS EC2
instances
 Scalability on commodity hardware
 Meeting our billing windows
 Unlocking revenue
 Adapt PoC for real-world
situations
 Continue scaling linearly
 Optimize results with small
cluster deployment
What Next?
Eric Frenkiel, MemSQL CEO and co-founder
September 30, 2015 • New York, NY
Introducing MemSQL Streamliner
 One click deployment of
integrated Apache Spark
 Put Spark in the Fast Lane
• GUI pipeline setup
• Multiple data pipelines
• Real-time transformation
 Eliminates batch ETL
 Open source on GitHub
Introducing the MemSQL Streamliner
Simple Deployment Process
Application
1. Deploy MemSQL
Cluster
In-Memory | Distributed | Relational
Application
2. Deploy Spark
Cluster
Application
Kafka Connects to Each Node
Cluster
Application
Streamliner Architecture
First of many integrated Apache Spark solutions
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
Streamliner ETL Detail
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
STREAMLINER
Custom
Future Extractor
JSON
Custom
Future Transformer
Extract Transform Load
Building Predictive Applications
Streamliner
Input
User Jar
SAS Generated PMML
Industrial
Equipment
Sensor Data
S1 S2 S3 P1 P2 P3
Scoring Real-Time Data
with Predictive Models
Sensor 1 Predictive Model 1
Streamliner Benefits
 Build end-to-end data pipelines in minutes
 Reduce data latency from days or hours to ZERO
 Support thousands of concurrent users running real-time
queries
 Give users immediate access to fresh data via innovative
applications
THE GAME
See MemSQL Streamliner in Action at Booth #831

Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

  • 1.
    End User Panelon Real-Time Data Analytics Building Predictive Applications with Real-Time Data Pipelines and Streamliner Eric Frenkiel, CEO and Co-Founder, MemSQL
  • 2.
    Going Real-Time isthe Next Phase for Big Data More Devices More Interconnectivity More User Demand …and companies are at risk of being left behind
  • 3.
    MemSQL Architecture St reamin g Da ta W areh o u se Streaming Integrated streaming with Streamliner Database High volume transactions for structured and unstructured data Data Warehouse Fast, scalable SQL for immediate analytics
  • 4.
    Applications and TechnologyTrends Real-Time Analytics Risk-Management Personalization Portfolio Tracking Monitoring and Detection Internet of Things | Real-Time Data Pipelines | Operationalizing Apache Spark
  • 5.
    Put Apache Sparkin the fast lane. Persist. Perform. Perfect.
  • 6.
    Changing the Waythe World Invests Noah Zucker, Vice President – Tactical Engineering, Novus Partners Scalable Portfolio Intelligence with MemSQL
  • 7.
     100+ InvestmentManagers, $2 Trillion AUM  Research Platform: 10,000+ Institutions  Founded 2007, Privately Held We help investors discover their true investment acumen and risk About Novus
  • 8.
    True Investment Acumenand Risk…at Scale
  • 9.
  • 10.
     24/7 ETLHandholding  Overnight Failure = Business Hours Slowdown  Scala worker pool limited by the database  Non-trivial code changes needed to shard and scale Before MemSQL…
  • 11.
    Today’s Portfolio Intelligence…RightNow Before MemSQL: With MemSQL: 90 Min. 2 Min. Customer Data Persistent StoreETL Analytics (Scala)
  • 12.
    First-Class JSON Support…HappyDevelopers memsql> select * from tasks t where t.task::uid::%clientId = 7; +---------+---------------------------------------------------------------+ | task_id | task | +---------+---------------------------------------------------------------+ | 3 | {"uid":{"clientId":7,"id":1009,"which":"P"},"user":"noahlz"} | +---------+---------------------------------------------------------------+ 1 row in set (0.00 sec) Salat
  • 13.
     Client teamfocuses on service, not ETL  Predictable application performance  Scala workers: 12  126  Add servers to scale – No code changes needed With MemSQL…
  • 14.
  • 15.
    Ian Hansen, SoftwareEngineering Manager Digital Ocean ETL Tools for Small Teams
  • 16.
    Problem: Business IntelligenceSlows as We Grow  Data lives in SQL  Easy to ask new questions in SQL  But… Business Intelligence tasks taking longer  Database isn’t built for quick aggregations
  • 17.
    Solution: Scale-out SQLDatabase  SQL team stays powerful  Quick to iterate with quick answers  Prepare for the future!
  • 18.
    Problem: Data isn’tin MemSQL Plus  You don’t have an engineer on your team  It’s hard to get an engineer’s time  You’ve got a job to do… (which is taking more and more time)
  • 19.
    Solution: ETL UsingREPLACE INTO  MySQL SQL flavor (available in MemSQL)  Handles new rows and updates on rows  Easy to write • Query source database then replace into target database  Many other scale-out SQL databases don’t have equivalent
  • 20.
    Problem: Now LoadJSON Event Data  ~300K events per day  Many different types of JSON events
  • 21.
    Solution: MemSQL Loader+ JSON Type  Only loads new files (or files whose content has changed)  Parallelizes the process  Transformation script simple: return id and raw json data  SQL team unaffected by new JSON events ./memsql-loader load /opt/events/** --table events --script=/opt/events-etl --file-id-column file_id --columns id,data
  • 22.
    Problem: Processing Dataon Select  Need computed value in SQL query  Computing the value slows down queries  Computed value used on many queries • e.g. domain from a URL string
  • 23.
    Solution: Persistent Columns Pre-compute result and save it on the row  Automatically updated if row changes  No need to alter ETL pipeline ALTER TABLE events ADD COLUMN ( referring_domain AS substring_index(substring(data::$re ferrer, (locate('//', data::$referrer)) + 2), '/', 1) PERSISTED varchar(255) )
  • 24.
    Solution: Persistent Columns Usepre-computed value in select memsql> select data, referring_domain from events limit 2; +-------------------------------------+------------------+ | data | referring_domain | +-------------------------------------+------------------+ | {"referrer":"http://example.com/b"} | example.com | | {"referrer":"http://example.com/a"} | example.com | +-------------------------------------+------------------+
  • 25.
    Tools  REPLACE INTOsyntax  JSON native type  MemSQL Loader  Persistent columns  Now, MemSQL Streamliner
  • 26.
  • 27.
  • 28.
    Mike DePrizio, SeniorArchitect, Akamai Technologies Unlocking Revenue with In-Memory Technology
  • 29.
    We are theleading provider of cloud services for delivering, optimizing and securing online content and business applications $1.96B Revenue 1,300 Locations 5,000+ Customers 5,100+ Employees CORPORATE STATS (2014): OUR HISTORY: Founded 1998 and rooted in MIT technology—solving Internet congestion with math not hardware
  • 30.
    The Business ofBilling Billing domino effect  Akamai  Customers  Sub-customers Daily billing requires:  Fast data delivery  Accurate data Old Model New Model Generating a bill at end of month for customer services Generating a bill at the end of every day for sub-customer services
  • 31.
    Current Billing DataManagement Gather logs from 190,000+ servers in 1400 locations in 110 countries  Multiple PBs/day aggregate/reduce into relevant billing data feed  Typical data record: 3 key fields plus metrics  Load resulting data record into our RDBMS system
  • 32.
    Greatest Challenges  Currentsystem cannot handle expected throughput  Difficult to quickly scale up existing environments  New model will generate 10x+ data
  • 33.
    Deploying MemSQL Application Daily Sub-customerbilling Problem Existing RDMS pipeline loads were maxed out at 150- 300K upserts/second, could not keep up with projected size of new billing model Results MemSQL cluster performs at 1.9 million upserts/second, allowing transition from monthly to daily billing Billing Data resource usage statistics INSERT... ON DUPLICATE KEY UPDATE... (1.9 million/sec) Billing Application • Compute sub-customer charges daily • Roll up sub-customer usage by customer/cloud provider • More sophisticated platform offers customers better service, partners new business opportunities
  • 34.
    Results Speak forThemselves  2M upserts/second on AWS EC2 instances  Scalability on commodity hardware  Meeting our billing windows  Unlocking revenue
  • 35.
     Adapt PoCfor real-world situations  Continue scaling linearly  Optimize results with small cluster deployment What Next?
  • 36.
    Eric Frenkiel, MemSQLCEO and co-founder September 30, 2015 • New York, NY Introducing MemSQL Streamliner
  • 37.
     One clickdeployment of integrated Apache Spark  Put Spark in the Fast Lane • GUI pipeline setup • Multiple data pipelines • Real-time transformation  Eliminates batch ETL  Open source on GitHub Introducing the MemSQL Streamliner
  • 38.
  • 39.
    1. Deploy MemSQL Cluster In-Memory| Distributed | Relational Application
  • 40.
  • 41.
    Kafka Connects toEach Node Cluster Application
  • 42.
    Streamliner Architecture First ofmany integrated Apache Spark solutions Other Real-Time Data Sources Application Apache Spark Future Solution Future Machine Learning Solution STREAMLINER
  • 43.
    Streamliner ETL Detail Other Real-TimeData Sources Application Apache Spark Future Solution Future Machine Learning Solution STREAMLINER STREAMLINER Custom Future Extractor JSON Custom Future Transformer Extract Transform Load
  • 44.
    Building Predictive Applications Streamliner Input UserJar SAS Generated PMML Industrial Equipment Sensor Data S1 S2 S3 P1 P2 P3 Scoring Real-Time Data with Predictive Models Sensor 1 Predictive Model 1
  • 45.
    Streamliner Benefits  Buildend-to-end data pipelines in minutes  Reduce data latency from days or hours to ZERO  Support thousands of concurrent users running real-time queries  Give users immediate access to fresh data via innovative applications
  • 46.
    THE GAME See MemSQLStreamliner in Action at Booth #831

Editor's Notes