PyATL Meetup, Oct 8, 2015

P r o c e s s i n g B i g D a t a P r e d i c t i v e A n a l y t i c s
PyATL Meetup 10/18/2015

‹#›
W h o A m I
• Roy Russo
• VP Engineering, Predikto

‹#›
W h y A m I H e r e ?
&
(Big Data) Predictive Analytics

‹#›
A g e n d a
• What we do
• Problems we faced
• How we solved them
• Rationale
• What’s good. What’s not.

‹#›
W h o i s P r e d i k t o ?
• Atlanta-based
• Founded in 2012
• Funded
• Paying Customers
• Mechanical Engineers
• Big Data Architects
• Global 1000
… and we don’t suck.

‹#›
W h a t w e d o ?
• Actionable Predictive Maintenance
• Predictive Analytics
• Real-time health scoring
• Unified asset health view
• SaaS

H O W D O E S P R E D I C T I V E
A N A L Y T I C S W O R K ?

‹#›
H o w W e D o I t
TRAIN MGT
SYSTEMS
INTELLITRAIN
GE RM&D
EMD
NYAB LEADER
MOTIVEPOWE
R
WABTEC
EAM
SAP
INFOR
ORACLE
MAXIMO
OTHER
BEACONS
WILD
WEATHER
TCIS
CUSTOM APPS
UMLER
TIME-BASED
ACTIONABLE
PREDICTIONS
PREDIKTO ENTERPRISE PLATFORM
PREDIKTO
INPUT
API’S
DATA
TRANSFORM.
ENGINE
MAX
MACHINE
LEARNING
ENGINE
PREDIKTO
OUTPUT
APIS
&
DASHBOARDS

‹#›
T h e P i p e l i n e
Standard JSON
AutoDynamic Feature
Engineering
AutoDynamic Feature
Selection
ETL
Email
SMS
Integration
UI Data Store
PrediktodataPipeline
“MAX”
OperationalIntegration
Data Aggregation/ETL Machine Learning/Analytics Outbound APIs/Integration

‹#›
H i g h - L e v e l R e q u i r e m e n t s
• ETL on LARGE datasets
• Fast
• Commodity hardware
• Feature scoring/selection on LARGE datasets
• Scale horizontally
• Runs from Instruction-set
• Visualize LARGE datasets
• Time-Series
• Fast
• Commodity hardware
• Support dynamic querying
• Differing Schema
Data Processing
Data Querying

‹#›
W h y S p a r k ?
• ETL:
• Shared Memory
• Not Disk-Bound
• Distributed workloads
• Feature scoring/selection on LARGE datasets
• Same as above
• New node = more capacity
• Spin up. Spin down.
• Runs from Instruction-set
• DAGs
• Python Devs… PySpark

‹#›
I m p l e m e n t i n g S p a r k
• Use DAGS
• Directed Acyclic Graphs
• Config-Driven Workflows
• Tune to Job
• Workers / Core
• Memory Tuning
• CPU or RAM ?
• Cons:
• Steep learning curve
• Dev & Ops
• Documentation
• Exception handling
• Native Scala

‹#›
M o r e o n D A G s
class comma_to_decimal(BaseDagTask):
@staticmethod
def run(sc, config, rdds, log):
from shippable.dag_tasks._utils import safe_map
out = []
for idx, rdd in enumerate(rdds):
if rdd is not None:
cols = [str(x).strip() for x in config['cols'].split(',')]
rdd = safe_map(rdd, lambda x: {k: v if str(k) not in cols else str(v).replace('.', '').replace(',', '.') for k, v in x.items()}, log)
out.append(rdd)
else:
out.append(None)
return out, None
"datetimes_ones": {
"after": "datatypes_ones",
"type": "convert_datetimes",
},
“comma_convert": {
"after": "datetimes_ones",
"type": “comma_to_decimals”,
},
"parentids_ones": {
"after": "devids_ones",
"type": "comma_convert",
},
"timestamps_ones": {
"after": "parentids_ones",
"type": "rename",
},

‹#›
W h y E l a s t i c s e a r c h ?
• Time-Series Data
• Fast-Reads
• Fast writes with Bulk Inserts
• Asynch
• Dynamic querying
• Differing schema
• Everything is indexed
• Visualization
• GeoJSON
• New node = more capacity
• Spark-ES Connector
• REST API & Python lib

‹#›
E S - H a d o o p
client = Elasticsearch(hosts=[es_uri])
query = '{"query": {"filtered": {"filter": {"terms": {"date_epoch": ' + some_var +'}}}}}'
es_conf = {
"es.nodes": es_host_name,
"es.port": es_host_port,
"es.resource": es_index + '/' + es_mapping,
"es.query": query
}
es_RDD = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_conf)
dicts = es_RDD.map(lambda x: x[1]).map(lambda x: reformat_location(x))

‹#›
I m p l e m e n t i n g E l a s t i c s e a r c h
• Tune
• Memory-Bound
• Large Drives (SSDs)
• Front w/ Load Balancer
• Spark
• Schema:
• Understand your customer’s data!
• Typing is important
• Timestamps, GeoHASH, Floats, etc…
• Cons:
• Moderate learning curve
• Query syntax
• Security

‹#›
A r c h i t e c t u r e

‹#›
I m p l e m e n t i n g E l a s t i c s e a r c h

P R E D I C T | P R E V E N T | P E R F O R M
http://www.predikto.com/company/careers

PyATL Meetup, Oct 8, 2015

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to PyATL Meetup, Oct 8, 2015

Similar to PyATL Meetup, Oct 8, 2015 (20)

More from Roy Russo

More from Roy Russo (7)

Recently uploaded

Recently uploaded (20)

PyATL Meetup, Oct 8, 2015