SlideShare a Scribd company logo
1 of 17
Functional Programming
and Big Data
Dan Marshall
Big Data Applications Engineer
June 11, 2015
© 2015 Wells Fargo Bank, N.A. All rights reserved. For public use.
11
Agenda
 Functional vs Other Models
 Functional Thinking and Big Data
 Case Study
22
Functional Programming
 It’s new and shiny! It’s the latest thing!
 Not so much…
– Mathematical foundations introduced with lambda calculus in the 1930s
– Lisp language created at MIT in 1958
– Google based much of original Map/Reduce research on well-established Functional Programming
concepts
 “Pure” Functional Languages
– Common Lisp, Haskell, F#, Erlang
– Can be mathematically proven to be correct (as opposed to “Design Patterns”)
 Hybrids
– Scala, Clojure
 Non-functional (mostly – with increasing FP capabilities)
– C, C++, Java, Python
This presentation focuses on real life application – not academic theory.
“Think functionally” is most important to start.
33
Functional Programming
 First-class functions
– Functions are “first-class citizens” and can be used anywhere in the code
– Function can be an argument to another function – or returned as a
result from a function
 Pure functions
– No side effects
– Facilitates parallel and/or concurrent operations and optimizations
 Recursion
– Iteration done through recursion (as opposed to looping)
 Immutability
– Avoid changing state and avoid mutable data
 Declarative (vs Imperative)
– What to do, not how to do it
http://en.wikipedia.org/wiki/Functional_programming
44
Simple Scala Example
55
Simple Scala Example
List of Hadoop
Summit
attendees
Define method
Function
literal
Pass
function
Pass
alternate
function
66
Python Example
77
Pure Functions
Y = f(x)
1. The function always evaluates the same result value given the same
argument value(s). The function result value cannot depend on any hidden
information…
2. Evaluation of the result does not cause any semantically observable side
effect or output, such as mutation of mutable objects…
http://en.wikipedia.org/wiki/Pure_function
 Parallelization – work in parallel, gather results
 Referentially Transparent – results can be cached and/or reused safely
 Lazy evaluation – Because of guarantee of no side effect, may be able to skip function
88
Recursion
• Important concept in computer science – many nuances – big topic
• For this discussion, think recursion vs iteration
• Recursion in FP allows for:
• “Iteration” without mutating
• “Tail call” optimization to avoid blowing out the stack
• Simpler code
Bad:
for (int x = 0 ; x < len(myList); x++) {
myList2(x) = doSomething with myList…
}
Good:
for (var x in myList):
produce myList2
Better:
myList2 = map(doSomething, myList)
99
Immutability
• Once an object/structure is created it cannot be changed
• Thread safety is no longer an issue
• Mutations produce side effects – (some you don’t know about)
• Code is more maintainable – it just does what it does
for (int x = 0 ; x < len(myList); x++) {
myList2(x) = doSomething with myList
}
Oops! – “myList” is altered by another thread during the loop
Thought for the day:
HDFS is broken – I can’t change a file! Hurry up and fix it!
(Are you sure?)
1010
Data-Centric Applications
• Functional Programming important in data-
centric applications
• Big Data tends to be highly data-centric
1111
Hadoop Map Reduce
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
“Function”
(jar file)
HDFS
Ship function
to data
list2 = map(function, list1)
Shuffle/
Sort
Reducer
Reducer
Reducer
Reducer
Reducer
Reducer
list3 = list2.reduceLeft(_ max _)
FP &
Big Data
 Immutability
 Recursion
 Functions
 Parallelism
 No side effects
1212
<OPINION>
Gen 1 Gen 2 Gen 3
Transform Map/Reduce
(Java)
Spark (Scala) Sparkling (Clojure)
Data Flow Pig
Cascading (Java)
Scalding (Scala) Cascalog (Clojure)
CEP Storm (Java) ScalaStorm
(Scala)
Storm (Clojure
DSL)
Messaging Kafka (Java) Kafka(Scala) Kafka (Clojure)
</OPINION>
Gravitate to same language
Increasingly Functional Over Time
Big Data Framework “disappears” into
Language
1313
Case Study
 Hadoop job to scan transactions within date range specified as argument to script
 Output a record for each transaction within the range which has:
– “closed_date” within the range specified – add “type” of “CLOSED”
– “open_date” within the range specified – add “type” of “OPENED”
– if one record has both dates within the range specified, output 2 records – 1 with each “type”
Cust
ID
Open Date Close Date
101 2014/02/01 null
202 2014/01/12 2014/02/22
303 2014/02/02 2014/02/28
404 2014/01/04 2014/03/14
Cust
ID
Open Date Close Date Type
101 2014/02/01 null OPENED
202 2014/01/12 2014/02/22 CLOSED
303 2014/02/02 2014/02/28 OPENED
303 2014/02/02 2014/02/28 CLOSED
Range specified: 2014/02/01 through 2014/02/28
1414
Case Study – Version 1
 Bash script date arithmetic to create multiple FOR LOOPs
– In LOOP executes “OPENED” Pig script – once per day in range
– In LOOP # 2, executes “CLOSED” Pig script – once per day in range
– So…for this example, a Pig script called 56 times
 Pig script:
– FILTER input BY (close or open) date for respective script
– Call Java UDF for transformation – assigns “Type” depending on Pig script version
 Outputs merged to same directory outside Pig
 Results were 100% correct from business standpoint
 To process a range of 1 month ~16 hours
1515
Case Study – Version 2
 Bash script passes date range to Pig script, calls 1 script 1 time
 Pig script:
– FILTER input BY close or open date within range (either or both match)
– Call Java UDF for transformation – assigns one or two “Type” depending on date values
 Results compared to “Version 1” code – matched exactly….unless I ran late in the day?!
 Side effect!
 Results were 100% correct from business standpoint
 To process a range of 1 month ~30 minutes
 3% of Version 1 runtime
Functional Programming
and Big Data
Dan Marshall
Big Data Applications Engineer
June 11, 2015
© 2015 Wells Fargo Bank, N.A. All rights reserved. For public use.

More Related Content

What's hot

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 

What's hot (20)

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 

Viewers also liked

a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
DataWorks Summit
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 

Viewers also liked (20)

Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source Technologies
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
 
large scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraphlarge scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraph
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 

Similar to Functional Programming and Big Data

Similar to Functional Programming and Big Data (20)

Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
 
Introduction to functional programming
Introduction to functional programmingIntroduction to functional programming
Introduction to functional programming
 
Functional Programming in JavaScript & ESNext
Functional Programming in JavaScript & ESNextFunctional Programming in JavaScript & ESNext
Functional Programming in JavaScript & ESNext
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
ScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin Odersky
 
Preparing for Scala 3
Preparing for Scala 3Preparing for Scala 3
Preparing for Scala 3
 
Polyglot and Functional Programming (OSCON 2012)
Polyglot and Functional Programming (OSCON 2012)Polyglot and Functional Programming (OSCON 2012)
Polyglot and Functional Programming (OSCON 2012)
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Swift: A parallel scripting for applications at the petascale and beyond.
Swift: A parallel scripting for applications at the petascale and beyond.Swift: A parallel scripting for applications at the petascale and beyond.
Swift: A parallel scripting for applications at the petascale and beyond.
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
F# Ignite - DNAD2010
F# Ignite - DNAD2010F# Ignite - DNAD2010
F# Ignite - DNAD2010
 
The Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms ProgrammingThe Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms Programming
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Functional Programming and Big Data

  • 1. Functional Programming and Big Data Dan Marshall Big Data Applications Engineer June 11, 2015 © 2015 Wells Fargo Bank, N.A. All rights reserved. For public use.
  • 2. 11 Agenda  Functional vs Other Models  Functional Thinking and Big Data  Case Study
  • 3. 22 Functional Programming  It’s new and shiny! It’s the latest thing!  Not so much… – Mathematical foundations introduced with lambda calculus in the 1930s – Lisp language created at MIT in 1958 – Google based much of original Map/Reduce research on well-established Functional Programming concepts  “Pure” Functional Languages – Common Lisp, Haskell, F#, Erlang – Can be mathematically proven to be correct (as opposed to “Design Patterns”)  Hybrids – Scala, Clojure  Non-functional (mostly – with increasing FP capabilities) – C, C++, Java, Python This presentation focuses on real life application – not academic theory. “Think functionally” is most important to start.
  • 4. 33 Functional Programming  First-class functions – Functions are “first-class citizens” and can be used anywhere in the code – Function can be an argument to another function – or returned as a result from a function  Pure functions – No side effects – Facilitates parallel and/or concurrent operations and optimizations  Recursion – Iteration done through recursion (as opposed to looping)  Immutability – Avoid changing state and avoid mutable data  Declarative (vs Imperative) – What to do, not how to do it http://en.wikipedia.org/wiki/Functional_programming
  • 6. 55 Simple Scala Example List of Hadoop Summit attendees Define method Function literal Pass function Pass alternate function
  • 8. 77 Pure Functions Y = f(x) 1. The function always evaluates the same result value given the same argument value(s). The function result value cannot depend on any hidden information… 2. Evaluation of the result does not cause any semantically observable side effect or output, such as mutation of mutable objects… http://en.wikipedia.org/wiki/Pure_function  Parallelization – work in parallel, gather results  Referentially Transparent – results can be cached and/or reused safely  Lazy evaluation – Because of guarantee of no side effect, may be able to skip function
  • 9. 88 Recursion • Important concept in computer science – many nuances – big topic • For this discussion, think recursion vs iteration • Recursion in FP allows for: • “Iteration” without mutating • “Tail call” optimization to avoid blowing out the stack • Simpler code Bad: for (int x = 0 ; x < len(myList); x++) { myList2(x) = doSomething with myList… } Good: for (var x in myList): produce myList2 Better: myList2 = map(doSomething, myList)
  • 10. 99 Immutability • Once an object/structure is created it cannot be changed • Thread safety is no longer an issue • Mutations produce side effects – (some you don’t know about) • Code is more maintainable – it just does what it does for (int x = 0 ; x < len(myList); x++) { myList2(x) = doSomething with myList } Oops! – “myList” is altered by another thread during the loop Thought for the day: HDFS is broken – I can’t change a file! Hurry up and fix it! (Are you sure?)
  • 11. 1010 Data-Centric Applications • Functional Programming important in data- centric applications • Big Data tends to be highly data-centric
  • 12. 1111 Hadoop Map Reduce Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper “Function” (jar file) HDFS Ship function to data list2 = map(function, list1) Shuffle/ Sort Reducer Reducer Reducer Reducer Reducer Reducer list3 = list2.reduceLeft(_ max _) FP & Big Data  Immutability  Recursion  Functions  Parallelism  No side effects
  • 13. 1212 <OPINION> Gen 1 Gen 2 Gen 3 Transform Map/Reduce (Java) Spark (Scala) Sparkling (Clojure) Data Flow Pig Cascading (Java) Scalding (Scala) Cascalog (Clojure) CEP Storm (Java) ScalaStorm (Scala) Storm (Clojure DSL) Messaging Kafka (Java) Kafka(Scala) Kafka (Clojure) </OPINION> Gravitate to same language Increasingly Functional Over Time Big Data Framework “disappears” into Language
  • 14. 1313 Case Study  Hadoop job to scan transactions within date range specified as argument to script  Output a record for each transaction within the range which has: – “closed_date” within the range specified – add “type” of “CLOSED” – “open_date” within the range specified – add “type” of “OPENED” – if one record has both dates within the range specified, output 2 records – 1 with each “type” Cust ID Open Date Close Date 101 2014/02/01 null 202 2014/01/12 2014/02/22 303 2014/02/02 2014/02/28 404 2014/01/04 2014/03/14 Cust ID Open Date Close Date Type 101 2014/02/01 null OPENED 202 2014/01/12 2014/02/22 CLOSED 303 2014/02/02 2014/02/28 OPENED 303 2014/02/02 2014/02/28 CLOSED Range specified: 2014/02/01 through 2014/02/28
  • 15. 1414 Case Study – Version 1  Bash script date arithmetic to create multiple FOR LOOPs – In LOOP executes “OPENED” Pig script – once per day in range – In LOOP # 2, executes “CLOSED” Pig script – once per day in range – So…for this example, a Pig script called 56 times  Pig script: – FILTER input BY (close or open) date for respective script – Call Java UDF for transformation – assigns “Type” depending on Pig script version  Outputs merged to same directory outside Pig  Results were 100% correct from business standpoint  To process a range of 1 month ~16 hours
  • 16. 1515 Case Study – Version 2  Bash script passes date range to Pig script, calls 1 script 1 time  Pig script: – FILTER input BY close or open date within range (either or both match) – Call Java UDF for transformation – assigns one or two “Type” depending on date values  Results compared to “Version 1” code – matched exactly….unless I ran late in the day?!  Side effect!  Results were 100% correct from business standpoint  To process a range of 1 month ~30 minutes  3% of Version 1 runtime
  • 17. Functional Programming and Big Data Dan Marshall Big Data Applications Engineer June 11, 2015 © 2015 Wells Fargo Bank, N.A. All rights reserved. For public use.