How to become an big data rockstar in 15 minutes - Akmal Chaudhri

•Download as PPTX, PDF•

0 likes•96 views

Dataconomy Media

Technology

How to become a
Big Data Rockstar
in 15 minutes!
Akmal Chaudhri
GridGain Systems
© 2018 GridGain Systems, Inc.

Agenda
• Turbocharging SQL queries
• Sharing data and state across Spark jobs
• Using Ignite’s ML library for Data Science
• Easing DevOps dilemmas with Kubernetes
© 2018 GridGain Systems, Inc.

Apache Ignite Database and Caching Platform
Memory-Centric Storage
Ignite Native Persistence
(Flash, SSD, Intel 3D XPoint)
Third-Party Persistence
(RDBMS, HDFS, NoSQL)
SQL Transactions Compute Services MLStreamingKey/Value
IoTFinancial
Services
Pharma &
Healthcare
E-CommerceTravel &
Logistics
Telco
© 2018 GridGain Systems, Inc.

• Database caching use case
• No “rip and replace”
performance boost
• Automatic read-through
and write-through
• ANSI-99 SQL
Turbocharging SQL Queries
© 2018 GridGain Systems, Inc.

Ignite and Spark Integration
Spark Application
Spark Worker
Spark
Job
Spark
Job
Yarn Mesos Docker HDFS
Spark Worker
Spark
Job
Spark
Job
Spark Worker
Spark
Job
Spark
Job
Ignite Node Ignite Node Ignite Node
Share state and
data among
Spark jobs
No data
movement
Boost
DataFrame and
SQL
Performance
SQL on top
of RDDs
In-place query
execution
© 2018 GridGain Systems, Inc.

Machine Learning
K-Means Regressions Decision Trees
R C++ Python Java
Server Node Server NodeServer Node
Distributed Core Algebra
DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY
Scala REST
Random Forest
Distributed Algorithms
Dense and Sparse
Algebra
Large Scale
Parallelization
Multi-Language
Support
Dense and Sparse
Algebra
No ETL
© 2018 GridGain Systems, Inc.

Top 5 by Commits
1. Hadoop
2. Ambari
3. Camel
4. Ignite
5. Beam
Top 5 Developer
Mailing Lists
1. Ignite
2. Kafka
3. Tomcat
4. Beam
5. James
Over 1M downloads per year
Top 5 User
Mailing Lists
1. Lucene/Solr
2. Ignite
3. Flink
4. Kafka
5. Cassandra
© 2018 GridGain Systems, Inc.
Among Top 5 Apache Projects

Any Questions?
Thank you for joining us. Follow the conversation.
http://ignite.apache.org
#apacheignite
© 2018 GridGain Systems, Inc.

What's hot

Spark | IBMRob Thomas

Netflix Big Data Paris 2017Jason Flittner

RedisConf17 - Real-time Intelligence with Redis-ML and Apache SparkRedis Labs

Building a Cross Cloud Data Protection EngineDatabricks

Big Data and ML on Google CloudWlodek Bielski

Cloudian HyperStore Operating EnvironmentCloudian

Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit

InfoTrack: Creating a single source of truth with the Elastic StackElasticsearch

Operationalizing Machine Learning at Scale at StarbucksDatabricks

Real-Time Analytics with Confluent and MemSQLSingleStore

The Fast Path to Building Operational Applications with SparkSingleStore

Brandon obrien streaming_dataNitin Kumar

Data pipeline and data lake for autonomous drivingYu Huang

Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S

Enabling Real-Time Analytics for IoTSingleStore

Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Databricks

Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana

Dataminds - ML in ProductionNathan Bijnens

Modern Data architecture DesignKujambu Murugesan

2017 Hackathon Scality & 42 SchoolScality

What's hot (20)

Spark | IBM

Netflix Big Data Paris 2017

RedisConf17 - Real-time Intelligence with Redis-ML and Apache Spark

Building a Cross Cloud Data Protection Engine

Big Data and ML on Google Cloud

Cloudian HyperStore Operating Environment

Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...

InfoTrack: Creating a single source of truth with the Elastic Stack

Operationalizing Machine Learning at Scale at Starbucks

Real-Time Analytics with Confluent and MemSQL

The Fast Path to Building Operational Applications with Spark

Brandon obrien streaming_data

Data pipeline and data lake for autonomous driving

Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole

Enabling Real-Time Analytics for IoT

Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...

Data saturday malta - ADX Azure Data Explorer overview

Dataminds - ML in Production

Modern Data architecture Design

2017 Hackathon Scality & 42 School

Similar to How to become an big data rockstar in 15 minutes - Akmal Chaudhri

Data Summer Conf 2018, “Apache Ignite + Apache Spark RDDs and DataFrames inte...Provectus

Improving Apache Spark™ In-Memory Computing with Apache Ignite™Tom Diederich

An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017Codemotion

Getting Started with Apache Ignite as a Distributed DatabaseRoman Shtykh

いそがしいひとのための Microsoft Ignite 2018 最新情報 Data 編Miho Yamamoto

PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli

Apache Spark and Apache Ignite: Where Fast Data Meets the IoTDenis Magda

Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics

[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1

Make your data fly - Building data platform in AWSKimmo Kantojärvi

Spark Summit EU talk by Christos ErotocritouSpark Summit

OSDC 2017 - Christos Erotocritou - Apache ignite in-memory data fabricNETWAYS

From Data to Services at the Speed of BusinessAli Hodroj

What's New in Upcoming Apache Spark 2.3Databricks

Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?SnapLogic

2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen

Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Amazon Web Services

Stream based Data IntegrationJeffrey T. Pollock

Mixing Analytic Workloads with Greenplum and Apache SparkVMware Tanzu

Data Architecture for Modern ApplicationsRaghu Chakravarthi

Similar to How to become an big data rockstar in 15 minutes - Akmal Chaudhri (20)

Data Summer Conf 2018, “Apache Ignite + Apache Spark RDDs and DataFrames inte...

Improving Apache Spark™ In-Memory Computing with Apache Ignite™

An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017

Getting Started with Apache Ignite as a Distributed Database

いそがしいひとのための Microsoft Ignite 2018 最新情報 Data 編

PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...

Apache Spark and Apache Ignite: Where Fast Data Meets the IoT

Knowledge Graph for Machine Learning and Data Science

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

Make your data fly - Building data platform in AWS

Spark Summit EU talk by Christos Erotocritou

OSDC 2017 - Christos Erotocritou - Apache ignite in-memory data fabric

From Data to Services at the Speed of Business

What's New in Upcoming Apache Spark 2.3

Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?

2018 02-08-what's-new-in-apache-spark-2.3

Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018

Stream based Data Integration

Mixing Analytic Workloads with Greenplum and Apache Spark

Data Architecture for Modern Applications

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

How to convert PDF to text with Nanonetsnaman860154

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Key Features Of Token Development (1).pptxLBM Solutions

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Pigging Solutions in Pet Food ManufacturingPigging Solutions

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems

Unblocking The Main Thread Solving ANRs and Frozen Frames

How to convert PDF to text with Nanonets

GenCyber Cyber Security Day Presentation

Injustice - Developers Among Us (SciFiDevCon 2024)

Presentation on how to chat with PDF using ChatGPT code interpreter

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

My Hashitalk Indonesia April 2024 Presentation

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Key Features Of Token Development (1).pptx

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Pigging Solutions in Pet Food Manufacturing

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Next-generation AAM aircraft unveiled by Supernal, S-A2

How to become an big data rockstar in 15 minutes - Akmal Chaudhri

3. Apache Ignite Database and Caching Platform Memory-Centric Storage Ignite Native Persistence (Flash, SSD, Intel 3D XPoint) Third-Party Persistence (RDBMS, HDFS, NoSQL) SQL Transactions Compute Services MLStreamingKey/Value IoTFinancial Services Pharma & Healthcare E-CommerceTravel & Logistics Telco © 2018 GridGain Systems, Inc.

5. Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job Spark Job Spark Worker Spark Job Spark Job Ignite Node Ignite Node Ignite Node Share state and data among Spark jobs No data movement Boost DataFrame and SQL Performance SQL on top of RDDs In-place query execution © 2018 GridGain Systems, Inc.

6. Machine Learning K-Means Regressions Decision Trees R C++ Python Java Server Node Server NodeServer Node Distributed Core Algebra DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY Scala REST Random Forest Distributed Algorithms Dense and Sparse Algebra Large Scale Parallelization Multi-Language Support Dense and Sparse Algebra No ETL © 2018 GridGain Systems, Inc.

7. Top 5 by Commits 1. Hadoop 2. Ambari 3. Camel 4. Ignite 5. Beam Top 5 Developer Mailing Lists 1. Ignite 2. Kafka 3. Tomcat 4. Beam 5. James Over 1M downloads per year Top 5 User Mailing Lists 1. Lucene/Solr 2. Ignite 3. Flink 4. Kafka 5. Cassandra © 2018 GridGain Systems, Inc. Among Top 5 Apache Projects

Editor's Notes

The Apache Ignite Platform Apache Ignite is a memory-centric data platform that is used to build fast, scalable & resilient solutions. At the heart of the Apache Ignite platform lies a distributed memory-centric data storage platform with ACID semantics, and powerful processing APIs including SQL, Compute, Key/Value and transactions. Built with a memory-centric approach, this enables Apache Ignite to leverage memory for high throughput and low latency whilst utilizing local disk or SSD to provide durability and fast recovery. The main difference between the memory-centric approach and the traditional disk-centric approach is that the memory is treated as a fully functional storage, not just as a caching layer, like most databases do. For example, Apache Ignite can function in a pure in-memory mode, in which case it can be treated as an In-Memory Database (IMDB) and In-Memory Data Grid (IMDG) in one. On the other hand, when persistence is turned on, Ignite begins to function as a memory-centric system where most of the processing happens in memory, but the data and indexes get persisted to disk. The main difference here from the traditional disk-centric RDBMS or NoSQL system is that Ignite is strongly consistent, horizontally scalable, and supports both SQL and key-value processing APIs. Apache Ignite platform can be integrated with third-party databases and external storage mediums and can be deployed on any infrastructure. It provides linear scalability, built-in fault tolerance, comprehensive security and auditing alongside advanced monitoring & management. The Apache Ignite platform caters for a range of use cases including: Core banking services, Real-time product pricing, reconciliation and risk calculation engines, analytics and machine learning.
Apache Ignite provides an implementation of Spark RDD abstraction which allows to easily share state in memory across Spark jobs. The main difference between native Spark RDD and IgniteRDD is that Ignite RDD provides a shared in-memory view on data across different Spark jobs, workers, or applications, while native Spark RDD cannot be seen by other Spark jobs or applications. The way IgniteRDD is implemented is as a view over a distributed Ignite cache, which may be deployed either within the Spark job executing process, or on a Spark worker, or in its own cluster. This means that depending on the chosen deployment mode the shared state may either exist only during the lifespan of a Spark application (embedded mode), or it may out-survive the Spark application (standalone mode) in which case the state can be shared across multiple Spark applications.
DEMO: run several ML samples from the standard distribution. Main benefits: No ETL – online “in place” ML In-memory speed & scale Large scale parallelization Optimized ML/DL algorithms Last-mile GPU optimization The rationale for building ML Grid is quite simple. Many users employ Ignite as the central high-performance storage and processing systems for various data sets. If they wanted to perform ML or Deep Learning (DL) on these data sets (i.e. training sets or model inference) they had to ETL them first into some other systems like Apache Mahout or Apache Spark. The roadmap for ML Grid is to start with core algebra implementation based on Ignite co-located distributed processing. The initial version was released with Ignite 2.0. Future releases will introduce custom DSLs for Python, R and Scala, growing collection of optimized ML algorithms such as Linear and Logistic Regression, Decision Tree/Random Forest, SVM, Naive Bayes, as well support for Ignite-optimized Neural Networks and integration with TensorFlow. Current beta version of Apache Ignite Machine Learning Grid (ML Grid) supports a distributed machine learning library built on top of highly optimized and scalable Apache Ignite platform and implements local and distributed vector and matrix algebra operations as well as distributed versions of widely used algorithms.
[1] http://globenewswire.com/news-release/2018/07/09/1534470/0/en/The-Apache-Software-Foundation-Announces-Annual-Report-for-2018-Fiscal-Year.html [2] https://blogs.apache.org/foundation/entry/apache-in-2017-by-the

How to become an big data rockstar in 15 minutes - Akmal Chaudhri

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to become an big data rockstar in 15 minutes - Akmal Chaudhri

Similar to How to become an big data rockstar in 15 minutes - Akmal Chaudhri (20)

More from Dataconomy Media

More from Dataconomy Media (20)

Recently uploaded

Recently uploaded (20)

How to become an big data rockstar in 15 minutes - Akmal Chaudhri

Editor's Notes