Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Collaborative Predictive Intelligence
via DDF-on-Flink using Distributed DataFrame
Christopher Nguyen, PhD—CEO & Co-Founde...
@arimoinc@pentagoniachttp//ddf.io
Who Are We?
What Do We Do?
@arimoinc@pentagoniachttp//ddf.io
What Are Adatao Big Apps?
§ Predictive: Predictive Analytics for Business Users
§ Coll...
@arimoinc@pentagoniachttp//ddf.io
Demo
@arimoinc@pentagoniachttp//ddf.io
The EXPLOSION
of Data & Compute engines
The CIO Challenge
ScalaClient
Scala
JavaClient
J...
@arimoinc@pentagoniachttp//ddf.io
Scala Java Python R
DDF
Spark Flink
DDF
Ignite
DDF
Data in Memory
Presto
DDF
Data at Res...
@arimoinc@pentagoniachttp//ddf.io
Benefits of DDF Data Integration
§ FOR DATA ENGINEERS
§ Unified API across data sources...
@arimoinc@pentagoniachttp//ddf.io
Custom 

Apps
Adatao AppBuilder
Adatao PredictiveEngine
Arimo Predictive Intelligence Pl...
@arimoinc@pentagoniachttp//ddf.io
Why Flink?
§ Emerging engine with unique strengths (e.g., streaming)
§ Driven by Custo...
@arimoinc@pentagoniachttp//ddf.io
Demo
@arimoinc@pentagoniachttp//ddf.io
Java Python R
DDF DDF DDF
Spark Flink Redshift
Spark APIs
RDD
DataFrame
DStream
…
Flink ...
@arimoinc@pentagoniachttp//ddf.io
DDF API in a Nutshell
// To start working with an engine
DDFManager manager = DDFManager...
@arimoinc@pentagoniachttp//ddf.io
Demo
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ It was easy for us to implement DDF on Flink
§ Flink API close to fu...
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ With DDF, it’s easy to port applications on DDF from one engine to an...
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ There’s now an opportunity to use Flink for interactive applications
...
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ Null/missing value handling in Flink
§ Null value support needed in ...
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ Map vs MapPartitions vs Accumulators
§ Map for aggregations can caus...
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ Use caution when doing array copy overs in Table API
@arimoinc@pentagoniachttp//ddf.io
DDF: Where is it heading?
§ More Engines: DBs & DWs: BigQuery, Cassandra, Teradata, Pre...
@arimoinc@pentagoniachttp//ddf.io
Get Started with DDF
§ Increase your productivity & build engine-agnostics Apps
• Build...
Collaborative Predictive Intelligence
via DDF-on-Flink using Distributed DataFrame
Christopher Nguyen, PhD—CEO & Co-Founde...
Upcoming SlideShare
Loading in …5
×

Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

655 views

Published on

*This talk was first presented at http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/225673273/*

Enterprise users today demand the ability to glean insights from their disparate data spread across varied transactional and analytics sources; hence, analytics application developers need the ability to connect to varied data & compute engines such as Spark, Flink, Cassandra, etc.

A key pain point for developers is the lack of a uniform API across data & compute engines, a limitation which adversely impacts developer productivity, while also restricting dataflow across different engines. DDF (Distributed DataFrame) is a simple but powerful API above and across multiple engines. Using DDF, developers reap significant benefits including (1) a unified and highly productive API for data/compute access, (2) the ability to process data at-source, bypassing the absolute requirement for a Hadoop data lake, and (3) future-proofing against rapidly shifting economics of specific data engines.

To date, DDF has been implemented on Spark, Flink, and other engines. In this talk we demonstrate, for the first time, a business-analyst-friendly realtime data exploration and visualization system working directly with Flink. We will show how a business user can enter natural-language questions of his/her data and get real-time answers from Flink, in the form of visual charts and tables. We’ll also show interaction with the DDF-on-Flink API at the developer level, and share our experience on the challenges and lessons learned in realizing this vision on Flink, and compare and contrast that with the same experience on Spark.

Speakers:
Christopher Nguyen, Co-Founder and CEO, Arimo
Rohit Rai, Founder and CEO of Tuplejump

Published in: Technology
  • Be the first to comment

Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

  1. 1. Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame Christopher Nguyen, PhD—CEO & Co-Founder, Arimo Rohit Rai—CEO, Tuplejump Bringing BigApps to Flink @arimoinc@pentagoniachttp//ddf.io
  2. 2. @arimoinc@pentagoniachttp//ddf.io Who Are We? What Do We Do?
  3. 3. @arimoinc@pentagoniachttp//ddf.io What Are Adatao Big Apps? § Predictive: Predictive Analytics for Business Users § Collaborative: Real-time Collaboration with Data Scientists
  4. 4. @arimoinc@pentagoniachttp//ddf.io Demo
  5. 5. @arimoinc@pentagoniachttp//ddf.io The EXPLOSION of Data & Compute engines The CIO Challenge ScalaClient Scala JavaClient Java PyClient Python RClient R Ignite HDFS S3 Redshift BigQ Cassandra RDBMS Spark Flink Presto Ignite HDFS S3 Redshift BigQ Cassandra RDBMS Spark Flink Presto Ignite HDFS S3 Redshift BigQ Cassandra RDBMS Spark FlinkPresto ScalaClient Scala PyClient Python JavaClient Java RClient R FlinkFlink Ignite HDFS RDBMS Redshift Cassandra HDFS RDBMS HDFS Flink
  6. 6. @arimoinc@pentagoniachttp//ddf.io Scala Java Python R DDF Spark Flink DDF Ignite DDF Data in Memory Presto DDF Data at Rest HDFS DDF DWs DBs Enterprise Data Bus DDF S3 DDF Redshift DDF BigQ DDF Cassandra DDF RDBMS The Solution: DDF Data Integration
  7. 7. @arimoinc@pentagoniachttp//ddf.io Benefits of DDF Data Integration § FOR DATA ENGINEERS § Unified API across data sources and engines § HDFS, S3, Cassandra, Redshift, BigQuery, RDBMS, Salesforce, Spark, Flink, Ignite … § FOR DATA SCIENTISTS § Uniform high-level DataFrame abstractions: ETL, ML, Streaming
  8. 8. @arimoinc@pentagoniachttp//ddf.io Custom 
 Apps Adatao AppBuilder Adatao PredictiveEngine Arimo Predictive Intelligence Platform Big Compute Big Data Big Apps Distributed DataFrame (DDF) Open Sourced Data ScientistBusiness User Data Engineer
  9. 9. @arimoinc@pentagoniachttp//ddf.io Why Flink? § Emerging engine with unique strengths (e.g., streaming) § Driven by Customer & Partner conversations
  10. 10. @arimoinc@pentagoniachttp//ddf.io Demo
  11. 11. @arimoinc@pentagoniachttp//ddf.io Java Python R DDF DDF DDF Spark Flink Redshift Spark APIs RDD DataFrame DStream … Flink APIs DataSet Table DataStream … ETL Interfaces ML Interfaces Streaming Interfaces Unified DDF APIs DDF: “Under the Hood”
  12. 12. @arimoinc@pentagoniachttp//ddf.io DDF API in a Nutshell // To start working with an engine DDFManager manager = DDFManager.get(“flink”); // or “spark” // Then, data can be loaded into a DDF as follows: DDF table = manager.sql2ddf("select * from airline"); // ETL, transform table = table.transform("dist= round(distance/2, 2)”); // Run Machine learning using MLlib, then run prediction KMeansModel kmeansModel = (KMeansModel) ddf.ML.train("kmeans", 5, 5).getRawModel(); Int prediction = ddf.ML.applyModel(kmeansModel, false, true);
  13. 13. @arimoinc@pentagoniachttp//ddf.io Demo
  14. 14. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § It was easy for us to implement DDF on Flink § Flink API close to functional collection API
  15. 15. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § With DDF, it’s easy to port applications on DDF from one engine to another
  16. 16. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § There’s now an opportunity to use Flink for interactive applications § Backtracking scheduler, session management, better graph analysis
  17. 17. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § Null/missing value handling in Flink § Null value support needed in RowSerializer
  18. 18. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § Map vs MapPartitions vs Accumulators § Map for aggregations can cause a lot of object creation overhead § Accumulators may fail for huge datasets
  19. 19. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § Use caution when doing array copy overs in Table API
  20. 20. @arimoinc@pentagoniachttp//ddf.io DDF: Where is it heading? § More Engines: DBs & DWs: BigQuery, Cassandra, Teradata, Presto, Ignite § Enterprise Databus to seamlessly move data across sources § Richer APIs
  21. 21. @arimoinc@pentagoniachttp//ddf.io Get Started with DDF § Increase your productivity & build engine-agnostics Apps • Build your analytics apps on existing modules • Flink, Spark, JDBC § Expand possibilities. Contribute to DDF • Enrich existing plugins: Data APIs, ML APIs... • Add new DDF plugins: • BigQuery, Cassandra • Marketo • Ignite, Presto § Spread the word! www.ddf.io/gettingstarted
  22. 22. Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame Christopher Nguyen, PhD—CEO & Co-Founder, Arimo Rohit Rai—CEO, Tuplejump Bringing BigApps to Flink @arimoinc@pentagoniachttp//ddf.io

×