1. P r o c e s s i n g B i g D a t a P r e d i c t i v e A n a l y t i c s
PyATL Meetup 10/18/2015
2. ‹#›
W h o A m I
• Roy Russo
• VP Engineering, Predikto
3. ‹#›
W h y A m I H e r e ?
&
(Big Data) Predictive Analytics
4. ‹#›
A g e n d a
• What we do
• Problems we faced
• How we solved them
• Rationale
• What’s good. What’s not.
5. ‹#›
W h o i s P r e d i k t o ?
• Atlanta-based
• Founded in 2012
• Funded
• Paying Customers
• Mechanical Engineers
• Big Data Architects
• Global 1000
… and we don’t suck.
6. ‹#›
W h a t w e d o ?
• Actionable Predictive Maintenance
• Predictive Analytics
• Real-time health scoring
• Unified asset health view
• SaaS
7. H O W D O E S P R E D I C T I V E
A N A L Y T I C S W O R K ?
9. ‹#›
H o w W e D o I t
TRAIN MGT
SYSTEMS
INTELLITRAIN
GE RM&D
EMD
NYAB LEADER
MOTIVEPOWE
R
WABTEC
EAM
SAP
INFOR
ORACLE
MAXIMO
OTHER
BEACONS
WILD
WEATHER
TCIS
CUSTOM APPS
UMLER
TIME-BASED
ACTIONABLE
PREDICTIONS
PREDIKTO ENTERPRISE PLATFORM
PREDIKTO
INPUT
API’S
DATA
TRANSFORM.
ENGINE
MAX
MACHINE
LEARNING
ENGINE
PREDIKTO
OUTPUT
APIS
&
DASHBOARDS
10. ‹#›
T h e P i p e l i n e
Standard JSON
AutoDynamic Feature
Engineering
AutoDynamic Feature
Selection
ETL
Email
SMS
Integration
UI Data Store
PrediktodataPipeline
“MAX”
OperationalIntegration
Data Aggregation/ETL Machine Learning/Analytics Outbound APIs/Integration
11. ‹#›
H i g h - L e v e l R e q u i r e m e n t s
• ETL on LARGE datasets
• Fast
• Commodity hardware
• Feature scoring/selection on LARGE datasets
• Scale horizontally
• Runs from Instruction-set
• Visualize LARGE datasets
• Time-Series
• Fast
• Commodity hardware
• Scale horizontally
• Support dynamic querying
• Differing Schema
Data Processing
Data Querying
12. ‹#›
W h y S p a r k ?
• ETL:
• Shared Memory
• Not Disk-Bound
• Distributed workloads
• Feature scoring/selection on LARGE datasets
• Same as above
• Scale horizontally
• New node = more capacity
• Spin up. Spin down.
• Runs from Instruction-set
• DAGs
• Python Devs… PySpark
13. ‹#›
I m p l e m e n t i n g S p a r k
• Use DAGS
• Directed Acyclic Graphs
• Config-Driven Workflows
• Tune to Job
• Workers / Core
• Memory Tuning
• CPU or RAM ?
• Cons:
• Steep learning curve
• Dev & Ops
• Documentation
• Exception handling
• Native Scala
15. ‹#›
M o r e o n D A G s
class comma_to_decimal(BaseDagTask):
@staticmethod
def run(sc, config, rdds, log):
from shippable.dag_tasks._utils import safe_map
out = []
for idx, rdd in enumerate(rdds):
if rdd is not None:
cols = [str(x).strip() for x in config['cols'].split(',')]
rdd = safe_map(rdd, lambda x: {k: v if str(k) not in cols else str(v).replace('.', '').replace(',', '.') for k, v in x.items()}, log)
out.append(rdd)
else:
out.append(None)
return out, None
"datetimes_ones": {
"after": "datatypes_ones",
"type": "convert_datetimes",
},
“comma_convert": {
"after": "datetimes_ones",
"type": “comma_to_decimals”,
},
"parentids_ones": {
"after": "devids_ones",
"type": "comma_convert",
},
"timestamps_ones": {
"after": "parentids_ones",
"type": "rename",
},
16. ‹#›
W h y E l a s t i c s e a r c h ?
• Time-Series Data
• Fast-Reads
• Fast writes with Bulk Inserts
• Asynch
• Dynamic querying
• Differing schema
• Everything is indexed
• Visualization
• GeoJSON
• Scale horizontally
• New node = more capacity
• Spark-ES Connector
• REST API & Python lib
17. ‹#›
E S - H a d o o p
client = Elasticsearch(hosts=[es_uri])
query = '{"query": {"filtered": {"filter": {"terms": {"date_epoch": ' + some_var +'}}}}}'
es_conf = {
"es.nodes": es_host_name,
"es.port": es_host_port,
"es.resource": es_index + '/' + es_mapping,
"es.query": query
}
es_RDD = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_conf)
dicts = es_RDD.map(lambda x: x[1]).map(lambda x: reformat_location(x))
18. ‹#›
I m p l e m e n t i n g E l a s t i c s e a r c h
• Tune
• Memory-Bound
• Large Drives (SSDs)
• Front w/ Load Balancer
• Spark
• Schema:
• Understand your customer’s data!
• Typing is important
• Timestamps, GeoHASH, Floats, etc…
• Cons:
• Moderate learning curve
• Query syntax
• Security