SlideShare a Scribd company logo
Distributed DataFrame (DDF)
Simplifying Big Data
For The Rest Of Us
Christopher Nguyen, PhD
Co-Founder & CEO
DATA INTELLIGENCE FOR ALL
1. Motivation: Problem & Solution
2. DDF Features & Benefits
3. Demo
Agenda
• Former Engineering Director of Google Apps 

(Google Founders’ Award)
• Former Professor and Co-Founder of the Computer
Engineering program at HKUST
• PhD Stanford, BS U.C. Berkeley Summa cum Laude
Christopher Nguyen, PhD
Adatao Inc.
Co-Founder & CEO
Small Data
World
RHadoop
HDFS
MapReduce
Hive
HBASE
Pig
SummingBird
Crunch
Storm
Scalding
Cascading
class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double,
upper: Double, isTraining: Boolean)	
extends RDD[Array[Double]](prev) {	
override def getPartitions: Array[Partition] = {	
val rg = new Random(seed)	
firstParent[Array[Double]].partitions.map(x => new SeededPartition(x,
rg.nextInt))	
}	
override def getPreferredLocations(split: Partition): Seq[String] =	
firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio
n].prev)	
override def compute(splitIn: Partition, context: TaskContext):
Iterator[Array[Double]] = {	
val split = splitIn.asInstanceOf[SeededPartition]	
val rand = new Random(split.seed)	
if (isTraining) {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
z < lower || z >= upper	
})	
} else {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
lower <= z && z < upper	
})	
}	
}	
}	
def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int,
trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]],
RDD[Array[Double]])] = {	
require(0 < trainingSize && trainingSize < 1)	
val rg = new Random(seed)	
(1 to numSplits).map(_ => rg.nextInt).map(z =>	
(new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true),	
new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator	
}	
Can’t It Be
Even
Simpler?
WHY
class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double,
upper: Double, isTraining: Boolean)	
extends RDD[Array[Double]](prev) {	
override def getPartitions: Array[Partition] = {	
val rg = new Random(seed)	
firstParent[Array[Double]].partitions.map(x => new SeededPartition(x,
rg.nextInt))	
}	
override def getPreferredLocations(split: Partition): Seq[String] =	
firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio
n].prev)	
override def compute(splitIn: Partition, context: TaskContext):
Iterator[Array[Double]] = {	
val split = splitIn.asInstanceOf[SeededPartition]	
val rand = new Random(split.seed)	
if (isTraining) {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
z < lower || z >= upper	
})	
} else {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
lower <= z && z < upper	
})	
}	
}	
}	
def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int,
trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]],
RDD[Array[Double]])] = {	
require(0 < trainingSize && trainingSize < 1)	
val rg = new Random(seed)	
(1 to numSplits).map(_ => rg.nextInt).map(z =>	
(new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true),	
new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator	
}	
table = load(“housePrices”)
table.dropNA
table.train.glm(“price”, “sf,zip,beds,baths”)
What
Happiness
Looks Like
vs.
!
!
HDFS
DAG Engine
MapReduce
Monad
FP
Bottom-Up
Top-Down
we begin to dream / find solution to fight this evil
Design Principles
Sophistication
Power & SpeedEase & Simplicity
HDFS
Web Browser R-Studio Python
Business Analyst Data Scientist Data Engineer
API
API
API
DDFClient
PAClient
PIClient
Example:
CLIENT WORKER WORKER WORKERWORKERMASTER
Demo Deployment
Diagram
Airline Dataset, 12GB, 123M rows
1
2
3
4
5
6
DDF Offers
Table-Like Abstraction
on Top of Big Data
1
1
2
3
4
5
6
DDF Offers
2 Native R Data.frame
Experience
ddf.getSummary()	
ddf.getFiveNum()	
ddf.binning(“distance”, 10)	
ddf.scale(SCALE.MINMAX)
1
2
3
4
5
6
DDF Offers
Real-Time
Scoring
manager.loadModel(uri)	
lm.predict(point)
Data
Wrangling
ddf.dropNA()	
ddf.transform(“speed
=distance/duration”)
ddf.lm(0.1, 10)	
ddf.roc(testddf)
Model
Building & Validation
3 Focus on Analytics,
Not MapReduce
1
2
3
4
5
6
DDF Offers
4 Seamlessly Integrate
with External ML Libraries
Data Algorithm Intelligence
+ =
1
2
3
4
5
6
DDF Offers
5 Share DDFs
Seamlessly & Efficiently
Can I see
your data?
ddf://adatao/airline
1
2
3
4
5
6
DDF Offers
6Work in Your
Preferred Language
…even Plain English!
1
2
3
4
5
6
Table-Like Abstraction
on Top of Big Data
Seamlessly Integrate with
External ML Libraries
Share DDFs
Seamlessly & efficiently
Native R Data.frame
Experience
Focus on Analytics, 

not MR
Work in Your
Preferred Language
as
DDF is to
Big Data
SQL+R is to
Small Data
www.adatao.com/ddf
DATA INTELLIGENCE FOR ALL
We’re Open Sourcing DDF!!!
1. Accept Core Committers
2. Clean up & Prep
3. Open up for public access

More Related Content

What's hot

Scaling modern JVM applications with Akka toolkit
Scaling modern JVM applications with Akka toolkitScaling modern JVM applications with Akka toolkit
Scaling modern JVM applications with Akka toolkitBojan Babic
 
MongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with MorphiaMongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with MorphiaScott Hernandez
 
テスト用のプレゼンテーション
テスト用のプレゼンテーションテスト用のプレゼンテーション
テスト用のプレゼンテーションgooseboi
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases LabFabio Fumarola
 
MongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceMongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceTakahiro Inoue
 
20110514 mongo dbチューニング
20110514 mongo dbチューニング20110514 mongo dbチューニング
20110514 mongo dbチューニングYuichi Matsuo
 
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)Grant Goodale
 
Windows Server 2012 Active Directory Recovery
Windows Server 2012 Active Directory RecoveryWindows Server 2012 Active Directory Recovery
Windows Server 2012 Active Directory RecoverySerhad MAKBULOĞLU, MBA
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)MongoSF
 
Introduction To Core Data
Introduction To Core DataIntroduction To Core Data
Introduction To Core Datadanielctull
 
Using Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data VisualisationUsing Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data VisualisationAlex Hardman
 
MySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of ThingsMySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of ThingsAlexander Rubin
 
Meetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4jMeetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4jNeo4j
 

What's hot (18)

Talk MongoDB - Amil
Talk MongoDB - AmilTalk MongoDB - Amil
Talk MongoDB - Amil
 
Scaling modern JVM applications with Akka toolkit
Scaling modern JVM applications with Akka toolkitScaling modern JVM applications with Akka toolkit
Scaling modern JVM applications with Akka toolkit
 
MongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with MorphiaMongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with Morphia
 
テスト用のプレゼンテーション
テスト用のプレゼンテーションテスト用のプレゼンテーション
テスト用のプレゼンテーション
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases Lab
 
MongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceMongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduce
 
MongoDB-SESSION03
MongoDB-SESSION03MongoDB-SESSION03
MongoDB-SESSION03
 
20110514 mongo dbチューニング
20110514 mongo dbチューニング20110514 mongo dbチューニング
20110514 mongo dbチューニング
 
Reading Data into R
Reading Data into RReading Data into R
Reading Data into R
 
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
 
Windows Server 2012 Active Directory Recovery
Windows Server 2012 Active Directory RecoveryWindows Server 2012 Active Directory Recovery
Windows Server 2012 Active Directory Recovery
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
 
Introduction To Core Data
Introduction To Core DataIntroduction To Core Data
Introduction To Core Data
 
Using Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data VisualisationUsing Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data Visualisation
 
Data aggregation in R
Data aggregation in RData aggregation in R
Data aggregation in R
 
WOTC_Import
WOTC_ImportWOTC_Import
WOTC_Import
 
MySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of ThingsMySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of Things
 
Meetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4jMeetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4j
 

Similar to Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkZalando Technology
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in SparkShiao-An Yuan
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions Dr. Volkan OBAN
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
From Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksFrom Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksSeniorDevOnly
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Introduction to R
Introduction to RIntroduction to R
Introduction to Ragnonchik
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
A deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonA deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonCheng Min Chi
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Using Scala Slick at FortyTwo
Using Scala Slick at FortyTwoUsing Scala Slick at FortyTwo
Using Scala Slick at FortyTwoEishay Smith
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11Computer Science Club
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 

Similar to Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us (20)

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
From Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksFrom Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risks
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
A deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonA deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidson
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Using Scala Slick at FortyTwo
Using Scala Slick at FortyTwoUsing Scala Slick at FortyTwo
Using Scala Slick at FortyTwo
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11
 
Scala
ScalaScala
Scala
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 

Recently uploaded

De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEJelle | Nordend
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfOrtus Solutions, Corp
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion Clinic
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisNeo4j
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessWSO2
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILNatan Silnitsky
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
 

Recently uploaded (20)

Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 

Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us

  • 1. Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO DATA INTELLIGENCE FOR ALL
  • 2. 1. Motivation: Problem & Solution 2. DDF Features & Benefits 3. Demo Agenda
  • 3. • Former Engineering Director of Google Apps 
 (Google Founders’ Award) • Former Professor and Co-Founder of the Computer Engineering program at HKUST • PhD Stanford, BS U.C. Berkeley Summa cum Laude Christopher Nguyen, PhD Adatao Inc. Co-Founder & CEO
  • 6.
  • 7. class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double, upper: Double, isTraining: Boolean) extends RDD[Array[Double]](prev) { override def getPartitions: Array[Partition] = { val rg = new Random(seed) firstParent[Array[Double]].partitions.map(x => new SeededPartition(x, rg.nextInt)) } override def getPreferredLocations(split: Partition): Seq[String] = firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio n].prev) override def compute(splitIn: Partition, context: TaskContext): Iterator[Array[Double]] = { val split = splitIn.asInstanceOf[SeededPartition] val rand = new Random(split.seed) if (isTraining) { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; z < lower || z >= upper }) } else { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; lower <= z && z < upper }) } } } def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int, trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]], RDD[Array[Double]])] = { require(0 < trainingSize && trainingSize < 1) val rg = new Random(seed) (1 to numSplits).map(_ => rg.nextInt).map(z => (new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true), new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator } Can’t It Be Even Simpler? WHY
  • 8. class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double, upper: Double, isTraining: Boolean) extends RDD[Array[Double]](prev) { override def getPartitions: Array[Partition] = { val rg = new Random(seed) firstParent[Array[Double]].partitions.map(x => new SeededPartition(x, rg.nextInt)) } override def getPreferredLocations(split: Partition): Seq[String] = firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio n].prev) override def compute(splitIn: Partition, context: TaskContext): Iterator[Array[Double]] = { val split = splitIn.asInstanceOf[SeededPartition] val rand = new Random(split.seed) if (isTraining) { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; z < lower || z >= upper }) } else { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; lower <= z && z < upper }) } } } def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int, trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]], RDD[Array[Double]])] = { require(0 < trainingSize && trainingSize < 1) val rg = new Random(seed) (1 to numSplits).map(_ => rg.nextInt).map(z => (new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true), new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator } table = load(“housePrices”) table.dropNA table.train.glm(“price”, “sf,zip,beds,baths”) What Happiness Looks Like
  • 10. we begin to dream / find solution to fight this evil
  • 12. HDFS Web Browser R-Studio Python Business Analyst Data Scientist Data Engineer API API API DDFClient PAClient PIClient Example:
  • 13. CLIENT WORKER WORKER WORKERWORKERMASTER Demo Deployment Diagram Airline Dataset, 12GB, 123M rows
  • 17. 2 Native R Data.frame Experience ddf.getSummary() ddf.getFiveNum() ddf.binning(“distance”, 10) ddf.scale(SCALE.MINMAX)
  • 21. 4 Seamlessly Integrate with External ML Libraries Data Algorithm Intelligence + =
  • 23. 5 Share DDFs Seamlessly & Efficiently Can I see your data? ddf://adatao/airline
  • 25. 6Work in Your Preferred Language …even Plain English!
  • 26. 1 2 3 4 5 6 Table-Like Abstraction on Top of Big Data Seamlessly Integrate with External ML Libraries Share DDFs Seamlessly & efficiently Native R Data.frame Experience Focus on Analytics, 
 not MR Work in Your Preferred Language
  • 27. as DDF is to Big Data SQL+R is to Small Data
  • 28. www.adatao.com/ddf DATA INTELLIGENCE FOR ALL We’re Open Sourcing DDF!!! 1. Accept Core Committers 2. Clean up & Prep 3. Open up for public access