node-crate: node.js & big data
by Stefan Thies
The path is the goal
2000-2013
www.verint.com
Dev Team Lead, Product Management,
Sales Engineer
since 2013
Consulting / Outsourcing
bigdata-analyst.de
just started …
DevOps Evangelist @
www.sematext.com
follow me @seti321
about me
Product evaluations
MarkLogic*
MongoDB*
Elas1csearch*
CouchDB*
CRATE*
0*
10*
20*
30*
40*
50*
60*
70*
80*
90*
Document)oriented)data)stores)Points)for)product)evalua4on)criterias)of)the)specific)project)(RT,)scalability,)replica4on,))features)and)commercial))
Datenreihe1*
How do I get here?
• 2012-2014 Systems with Elasticsearch &
• Mobile Apps (Geo) with Appcelerator Titanium
• Data enrichment & Webcrawlers (whois, geo, appstores)
• Distributed Regex-Processing for CyberSec with 0MQ
• Security Layer around Elasticsearch (sails.js)
• … we did almost everything in NodeJS
Design criterias
• Scalable & lean architecture
• Operations: NO Zoo of 3rd party
components
• We choosed Elasticsearch at that time
• Automatic installation, Docker
• One Language: JavaScript / Node.js
Security
& Admin
- Policies, Users, Roles
- REST API
- Websockets / RT
„data enrichment“
• Hey, we got Elasticsearch - lookup queries for ‚static‘ data sources will be fast!
• Distributed processing based on 0MQ (pull/push) - high throughput, parallel
processing, distributed worker processes
collection
Information
extraction and
processing
data lookups Elasticsearch
Information
extraction and
processing
data lookups Elasticsearch
any problem?
collect
mass data
Elastic
search
Analyze &
Visualize
other data
sources
Geo
Company
data
Open Source
Information
massive updates!
processing
queue / workers
Reporting (PDF)
Accurate Counts
(Facets) -> Aggregation
OPS issues

alternative ‚any‘ DB (for updates) + ES
• It’s a big mess regarding compatibility, maintenance and monitoring all components -
each box can be multiple machines, River might not be updated to latest DB or ES
version, a bug might force you to upgrade one of the components and there the trouble
of dependency starts …
• Reporting: custom programming DSL Queries, Rendering HTML with PhantomJS to
PDF - painful if you know standard Report generators from SQL world. How to tell the
customer to adapt it to his needs? Using some ‚standard‘ DB (SQL or NoSQL)
supported by the reporting tools would solve it.
DB
Vx.x
Data-Procssing
Services
DB-River
V y.y
Elasticsearch
V z.z
Search & Analytics
V. b.b
Don’t panic
google like …
A match at Slideshare!
• An early presentation of
from Jodok got my attention
• http://www.crate.io
• The Mountain Hackathon 2014
birthday of node-crate
Package status
• Igor Likhamanov
• Stefan Thies
• Martin Heidegger 

joined recently and made 

high professional quality 

improvements!
DevOps: Stack-Shrinking
• From 3 down to 1 storage service:
DB
Vx.x
Data-
Processing
DB-River
V y.y
Elasticsearch
V z.z
Search & Analytics
V. b.b
Crate
V a.a
Search & Analytics
V. b.b
Data-
Processing
Data Enrichment Performance
• Elasticsearch has no „update by query“
• If we need to update e.g. 50.000 records it means running a query to identify the relevant
records and send 50.000 HTTP requests for update or build a a large bulk update
request with 50.000 instructions -> overhead! -> K.O
• In Crate
• update something where something_else = ‚other_value’
• ONE command, still a heavy operation because of Lucene delete/index BUT 

not ’50.000 commands/network roundtrips’ on top …

Data Enrichment - performance
collect
mass data
CRATE
data store
Analyze &
Visualize
other data
sources
Geo …
Open Source
Information
massive updates, no issue :)
processing
queue / workers
Reporting (PDF)
using CRATE JDBC
BLOB’s 

(Images, videos, packet data, …)
• Traditionally
• Meta-Data in DB + Files in some filesystem / separate object storage
• Both behave different for scaling
• Crate stores BLOB’s like other shards including replicas
• More nodes more capacity, replicas etc.
• BLOB storage scales with the data store
• Would be perfect for ‚dropbox‘ like service :) or any archived data
Demo: Installation, usage, 

examples walk through …
• https://www.youtube.com/watch?v=ZaDFrd4ZwQk (setup)
• https://github.com/megastef/node-crate (node-crate on github)
• http://techblog.bigdata-analyst.de (sample applications)
• https://crate.io/docs/stable/ (documentation of CRATE.IO)
Simple Example
Import Data (bulk insert)
COPY web_log FROM ‚/var/logs/web_log.json‘ 

WITH (bulk_size=15000, concurrency=2)
create table web_log
(ts timestamp, host
string, …);
Special data types for
- IP
- Geo Shapes
- Objects (dynamic)
insert into web_log
(ts,useragent, ..)
values (132323,
‚Safari‘, …)
select
update
Anything missing?
• „Kibana“
• see my blog how to add it (‚officially‘ not supported)
• Performance monitoring
• see next section …
Using Kibana with Crate
Performance Monitoring
Setup & Run
If you can’t measure it
you can’t fix it!
Monitoring -
Sematext SPM supported Applications
+
Release status for CRATE/SPM monitor: Prototype
pls. call me upon demand
NEW
SPM Monitoring
My NPM Modules
• node-crate - DB driver for Crate for NodeJS - help for ‚Waterline/sails.js‘
ORM appreciated! We are open for other suggestions, we like sails.js
Websocket capability and security features (policies) and would get that ‚for
free‘
• winston-crate - logger transport for Crate using node-crate
• bro-ids - simple interface to the BRO intrusion detection system (IP
Monitoring)
+ sematext related work
• node-red-contrib-logsene - Node-Red (IoT, MQTT, …) - Logger for Logsene
• node-spm - Custom Metrics & Logging API for http://www.sematext.com 

adapted for NodeJS
• spmagent - Performance Monitoring for Node.js
• Garbage Collection, Event Loop Monitor, HTTP Metrics, Cluster mode, …
• Release: Very Soon! - Feb 2015 - stefan.thies@sematext.com for early access
Dig	
  Search?	
  
Dig	
  Analy0cs?	
  
Dig	
  Big	
  Data?	
  
Dig	
  Performance?	
  
Dig	
  Logging?	
  
Dig	
  working	
  with	
  open	
  –	
  source?	
  
We‘re	
  hiring	
  planet	
  -­‐	
  wide!

h2p://www.sematext.com/about/jobs.html	
  
Thank you for your attention.
03.03.15 DevOps Frankfurt 

“Metrics & more …”

node-crate: node.js and big data

  • 1.
    node-crate: node.js &big data by Stefan Thies
  • 2.
    The path isthe goal
  • 3.
    2000-2013 www.verint.com Dev Team Lead,Product Management, Sales Engineer since 2013 Consulting / Outsourcing bigdata-analyst.de just started … DevOps Evangelist @ www.sematext.com follow me @seti321 about me
  • 4.
  • 5.
    How do Iget here? • 2012-2014 Systems with Elasticsearch & • Mobile Apps (Geo) with Appcelerator Titanium • Data enrichment & Webcrawlers (whois, geo, appstores) • Distributed Regex-Processing for CyberSec with 0MQ • Security Layer around Elasticsearch (sails.js) • … we did almost everything in NodeJS
  • 6.
    Design criterias • Scalable& lean architecture • Operations: NO Zoo of 3rd party components • We choosed Elasticsearch at that time • Automatic installation, Docker • One Language: JavaScript / Node.js
  • 7.
    Security & Admin - Policies,Users, Roles - REST API - Websockets / RT
  • 8.
    „data enrichment“ • Hey,we got Elasticsearch - lookup queries for ‚static‘ data sources will be fast! • Distributed processing based on 0MQ (pull/push) - high throughput, parallel processing, distributed worker processes collection Information extraction and processing data lookups Elasticsearch Information extraction and processing data lookups Elasticsearch
  • 9.
    any problem? collect mass data Elastic search Analyze& Visualize other data sources Geo Company data Open Source Information massive updates! processing queue / workers Reporting (PDF) Accurate Counts (Facets) -> Aggregation
  • 10.
    OPS issues
 alternative ‚any‘DB (for updates) + ES • It’s a big mess regarding compatibility, maintenance and monitoring all components - each box can be multiple machines, River might not be updated to latest DB or ES version, a bug might force you to upgrade one of the components and there the trouble of dependency starts … • Reporting: custom programming DSL Queries, Rendering HTML with PhantomJS to PDF - painful if you know standard Report generators from SQL world. How to tell the customer to adapt it to his needs? Using some ‚standard‘ DB (SQL or NoSQL) supported by the reporting tools would solve it. DB Vx.x Data-Procssing Services DB-River V y.y Elasticsearch V z.z Search & Analytics V. b.b
  • 11.
  • 12.
    A match atSlideshare! • An early presentation of from Jodok got my attention • http://www.crate.io
  • 13.
    • The MountainHackathon 2014 birthday of node-crate
  • 14.
    Package status • IgorLikhamanov • Stefan Thies • Martin Heidegger 
 joined recently and made 
 high professional quality 
 improvements!
  • 15.
    DevOps: Stack-Shrinking • From3 down to 1 storage service: DB Vx.x Data- Processing DB-River V y.y Elasticsearch V z.z Search & Analytics V. b.b Crate V a.a Search & Analytics V. b.b Data- Processing
  • 16.
    Data Enrichment Performance •Elasticsearch has no „update by query“ • If we need to update e.g. 50.000 records it means running a query to identify the relevant records and send 50.000 HTTP requests for update or build a a large bulk update request with 50.000 instructions -> overhead! -> K.O • In Crate • update something where something_else = ‚other_value’ • ONE command, still a heavy operation because of Lucene delete/index BUT 
 not ’50.000 commands/network roundtrips’ on top …

  • 17.
    Data Enrichment -performance collect mass data CRATE data store Analyze & Visualize other data sources Geo … Open Source Information massive updates, no issue :) processing queue / workers Reporting (PDF) using CRATE JDBC
  • 18.
    BLOB’s 
 (Images, videos,packet data, …) • Traditionally • Meta-Data in DB + Files in some filesystem / separate object storage • Both behave different for scaling • Crate stores BLOB’s like other shards including replicas • More nodes more capacity, replicas etc. • BLOB storage scales with the data store • Would be perfect for ‚dropbox‘ like service :) or any archived data
  • 19.
    Demo: Installation, usage,
 examples walk through … • https://www.youtube.com/watch?v=ZaDFrd4ZwQk (setup) • https://github.com/megastef/node-crate (node-crate on github) • http://techblog.bigdata-analyst.de (sample applications) • https://crate.io/docs/stable/ (documentation of CRATE.IO)
  • 20.
  • 21.
    Import Data (bulkinsert) COPY web_log FROM ‚/var/logs/web_log.json‘ 
 WITH (bulk_size=15000, concurrency=2)
  • 22.
    create table web_log (tstimestamp, host string, …); Special data types for - IP - Geo Shapes - Objects (dynamic)
  • 23.
    insert into web_log (ts,useragent,..) values (132323, ‚Safari‘, …)
  • 24.
  • 25.
    Anything missing? • „Kibana“ •see my blog how to add it (‚officially‘ not supported) • Performance monitoring • see next section …
  • 26.
  • 27.
  • 28.
    Setup & Run Ifyou can’t measure it you can’t fix it!
  • 30.
    Monitoring - Sematext SPMsupported Applications + Release status for CRATE/SPM monitor: Prototype pls. call me upon demand NEW
  • 31.
  • 32.
    My NPM Modules •node-crate - DB driver for Crate for NodeJS - help for ‚Waterline/sails.js‘ ORM appreciated! We are open for other suggestions, we like sails.js Websocket capability and security features (policies) and would get that ‚for free‘ • winston-crate - logger transport for Crate using node-crate • bro-ids - simple interface to the BRO intrusion detection system (IP Monitoring)
  • 33.
    + sematext relatedwork • node-red-contrib-logsene - Node-Red (IoT, MQTT, …) - Logger for Logsene • node-spm - Custom Metrics & Logging API for http://www.sematext.com 
 adapted for NodeJS • spmagent - Performance Monitoring for Node.js • Garbage Collection, Event Loop Monitor, HTTP Metrics, Cluster mode, … • Release: Very Soon! - Feb 2015 - stefan.thies@sematext.com for early access
  • 34.
    Dig  Search?   Dig  Analy0cs?   Dig  Big  Data?   Dig  Performance?   Dig  Logging?   Dig  working  with  open  –  source?   We‘re  hiring  planet  -­‐  wide!
 h2p://www.sematext.com/about/jobs.html  
  • 35.
    Thank you foryour attention. 03.03.15 DevOps Frankfurt 
 “Metrics & more …”