Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

1,196 views

Published on

This talk will go over how to build an end-to-end data processing system in Python, from data ingest, to data analytics, to machine learning, to user presentation. Developments in old and new tools have made this particularly possible today. The talk in particular will talk about Airflow for process workflows, PySpark for data processing, Python data science libraries for machine learning and advanced analytics, and building agile microservices in Python.

System architects, software engineers, data scientists, and business leaders can all benefit from attending the talk. They should learn how to build more agile data processing systems and take away some ideas on how their data systems could be simpler and more powerful.

Published in: Technology
  • Be the first to comment

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

  1. 1. DONALD MINER End-to-end Big Data Projects with Python DONALD MINER StampedeCon Big Data July 25th, 2017
  2. 2. dminer@minerkasch.com Donald Miner
  3. 3. 2009 BIG DATA =
  4. 4. Real “Big Data” isn’t just about platforms anymore
  5. 5. Real “Big Data” isn’t just about platforms anymore Streaming Infrastructure Cloud Applications Mobile {
  6. 6. Real “Big Data” isn’t just about data processing anymore
  7. 7. Real “Big Data” isn’t just about data processing anymore Machine Learning Data Science NLP Deep Learning Visualization {
  8. 8. 2017 BIG DATA = & integrated and user facing & advanced analytics
  9. 9. Big Data in 2009 was so Java oriented. It was easier to use Java for everything or use a collection of random languages.
  10. 10. Python seemed to have everything we wanted, except for Big Data Some brave souls tried: Hadoopy, mrjob, Pig+Python
  11. 11. PySpark PySpark was the missing piece of the Big Data Python picture The first major Big Data platform with first-class Python support Thanks to PySpark, Python is now a viable and competitive option for end-to-end systems that utilize Big Data
  12. 12. What’s the big deal? Python has best in class functionality for all the other things we want to do with Big Data: Data manipulation, Machine Learning, Text, Applications, Visualization In 2017, we can build end-to-end Big Data systems entirely in Python: from ingest to user experience and everything between
  13. 13. The case for Python Succinct code that’s easy to read
  14. 14. The case for Python A language people know
  15. 15. The case for Python Interpreted, not compiled
  16. 16. The Python Big Data Architecture
  17. 17. Distributed Computing # Read data as lines from a source lines = spark.read.text(inpath).rdd.map(lambda r: r[0]) # Count the data counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add) # Bring it locally output = counts.collect()
  18. 18. Machine Learning # Initialize Random Forest classifier clf = sklearn.ensemble.RandomForestClassifier(n_estimators=250) # Train Random Forest classifier clf = clf.fit(feature_vectors, labels) Why sklearn over MLLib?
  19. 19. Deep Learning # Load the ImageNet pretrained network model = VGG16(weights="imagenet") # Run the model on an image preds = model.predict(preprocess_input(image)) # Hotdog or not hotdog? print(decode_predictions(preds)) & Keras Also excited about pytorch!
  20. 20. Visualization Lots of visualization options in Python • Seaborn • Matplotlib • Bokeh • ggplot seaborn.swarmplot(x="measurement", y="value", hue="species", data=iris)
  21. 21. Integration # http://56.120.177.55/hello?name=Don @app.route('/hello') def say_hello(): name = request.args.get(‘name’) return json.dumps({ ‘query’ : name, ‘message’: ‘HELLO ‘ + name }) # returns { ‘query’ : ‘Don’, ‘message’ : ‘HELLO Don’ }
  22. 22. Workflows # Run this every day at 3:45AM mdag = DAG(’DRSpark', description=’DailyRun', schedule_interval=’45 3 * * *') sp1 = PythonOperator(task_id=‘sp1’, python_callable=runspark1, dag=mdag) sp2 = PythonOperator(task_id=’sp2’, python_callable=runspark2, dag=mdag) ou = PythonOperator(task_id=‘clean’, python_callable=cleanupresults, dag=mdag) sp1 >> ou # sp1 happens before ou sp2 >> ou # sp2 happens before ou, but doesn’t depend on sp1
  23. 23. # Spark job to build feature vectors rows = myrdd.map(lambda r: r[0].split(‘,’)) out = rows.map(lambda row: (row[0], row)) .groupByKey().map(build_feature_vector) # outputs [(FV, label)] # Bring data down locally and prepare it localout = counts.collect() X = [ row[0] for row in localout ] # feature is set of 40 aggregate properties t = [ row[1] for row in localout ] # potential labels are types of devices # Train a RF classifier on it clf = sklearn.ensemble.RandomForestClassifier(n_estimators=250) clf = clf.fit(X, y) # save the model (maybe to s3 instead?) pickle.dump(clf, open(‘/models/behavior.sklearn’, ‘w’)) --- training data sample of netflow --- SOURCE IP DEST IP DATE STIME ETIME DATAIN DATAOUT 123.41.12.31, 123.41.155.32, 2017-02-01, 09:00, 09:59, 103KB, 959KB 123.41.59.99, 123.41.155.32, 2017-02-01, 09:00, 09:59, 44KB, 884KB 123.41.12.31, 123.41.155.32, 2017-02-01, 10:00, 10:59, 3KB, 9KB 123.41.59.99, 123.41.155.32, 2017-02-01, 10:00, 10:59, 4KB, 15KB
  24. 24. # http://56.120.177.55/predictip?ip=159.31.120.44 # http://56.120.177.55/predictfv -- for POST of feature vector _MODEL = pickle.load(open(’/models/maintenance.sklearn’)) @app.route('/predictrepair') def predicttypefrombehavior(): netflowlog = request.form[‘logcsv’] fv = build_feature_vetor(netflowlog) pr = _MODEL.predict(fv) return json.dumps({ ‘query’ : fv, ‘prediction’ : pr }) # returns { ‘query’ : [9, 4, 123.1, …], ‘prediction’ : ‘HTTP PROXY’ } --- training data sample of netflow --- SOURCE IP DEST IP DATE STIME ETIME DATAIN DATAOUT 123.41.12.31, 123.41.155.32, 2017-02-01, 09:00, 09:59, 103KB, 959KB 123.41.59.99, 123.41.155.32, 2017-02-01, 09:00, 09:59, 44KB, 884KB 123.41.12.31, 123.41.155.32, 2017-02-01, 10:00, 10:59, 3KB, 9KB 123.41.59.99, 123.41.155.32, 2017-02-01, 10:00, 10:59, 4KB, 15KB
  25. 25. PYTHON! Viable option for Big Data Analytics with PySpark Tie it all together and integrate into the enterprise with the same language Leverage the benefits of Python for data analysis Get projects done faster

×