Introduction to Hadoop Ecosystem

•

0 likes•1,732 views

Did you like it? Check out our blog to stay up to date: https://getindata.com/blog The presentation contents a fundamental knowlege about Hadoop Ecosystem. It includes a popular technology as HDFS, YARN, HIVE Spark and Flink

Technology

©
$ hdfs dfs -ls /user/tiger
$ hdfs dfs -put songs.txt /user/tiger
$ hdfs dfs -cat /user/tiger/songs.txt
$ hdfs dfs -mkdir songs
$ hdfs dfs -mv songs.txt songs
$ hdfs dfs -rmr songs

©
■
$ hdfs dfs -put songs.txt /user/tiger
Question?

©
■
●
●
■
Image source:
http://pixgood.com/slicing-bread.html

©
■
$ hdfs dfs -cat /user/tiger/songs.txt
Question?

©
1. Offers compute
resources such as CPU
and RAM
2. Runs tasks of the
applications submitted by
users
3. Reports to the Master

©
1. Knows about all Slaves
2. Knows about available and
occupied resources on each
Slave
3. Schedules jobs submitted by
clients

©
A user can submit
any type of
application that is
supported by YARN

©
1. Started and
overseen by
Resource
Manager
2. Coordinates the
execution of all
tasks within an
application
3. Asks for
resources
needed to run its
tasks
4. Runs on the Node
Manager

©
■
●
■
Containers are
dynamically
created and
deleted

©
■
Large volume
of data
Computation
e.g. a JAR file

©
1. NodeManagers
should be
collocated with
DataNodes
2. The Resource
Manager tries to
schedule tasks on
a node which is the
closest to the data
3. Large volumes of
data don’t have to
be sent over the
network

©
Their reality
■
■
●
■
Their conclusion
■

©
HADOOP
MR
MR
SOME
MAGIC
1. Parses query
2. Plans execution
3. Submits jobs
4. Monitors jobs
5. Returns results
Execution
SELECT trackid,
COUNT(*) AS cnt
FROM stream
GROUP BY trackid
ORDER BY cnt DESC;
Results

©
HADOOP
MR
MR
APACHE
HIVE
Results
1. Parses query
2. Plans execution
3. Submits jobs
4. Monitors jobs
5. Returns results
Execution
SELECT trackid,
COUNT(*) AS cnt
FROM stream
GROUP BY trackid
ORDER BY cnt DESC;

©
RDBMS
Hive
Metastore
Stores Hive
metadata
Manages metadata
about databases,
tables and views

©
Hive Shell CLI
BeesWax
HUE
RDBMS
Hive
Metastore
Acts as a proxy
for “ligth” clients
JDBC/ODBC
Hive Server 2
Beeline CLI

©
Job 1 Job 2
Possible to cache dataset
in cluster’s (distributed)
memory to read it faster
in next jobs
HDFS
Read
Memory
Read
Cache In
Memory
Cache In
Memory
Memory
Read

©
Job 1 Job 2
Great fit for
iterative algorithms
and interactive
queries!
HDFS
Read
Memory
Read
Cache In
Memory
Cache In
Memory
Possible to cache dataset
in cluster’s (distributed)
memory to read it faster
in next jobs
Memory
Read

©
Interactive queries
Iterative algorithms
Input
Query 2
Query 1
Query 3
Input Iteration 1 Iteration 2
Distributed
Memory

©
NodeManager
Client
YARN Container
Spark
Application
Master
Spark Driver
Resource Manager
NodeManager
YARN Container
Spark
Executor Spark Task
NodeManager
YARN Container
Spark
Executor
Spark Task

©
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn
--deploy-mode cluster
--driver-memory 4g
--executor-memory 20g
--executor-cores 3
lib/spark-examples*.jar
10

©
■
Spark Core
Spark
SQL
Spark
Streaming
(near real-time,
micro-batch)
MLlib
(machine
learning)
GraphFrames
(graph
processing)
SparkR
(R on
Spark)

©
<-
INGEST
<-
STORE
<-
MANAGE
<-
ANALYZE

©
Non - stop
Each event
or
each minute
or
each user
session
Real-time
event
collection
Stream
processing

If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8 Proper data management and process scheduling are challenges that many data-driven companies under-prioritize. Although it might not cause troubles in short run, it becomes a nightmare when your cluster grows. However, even when you realize this problem, you might not see that possible solutions are so close... In this talk, we share how we simplified our data management and process scheduling in Hadoop with useful (but less adopted) open-source tools. We describe how Falcon, HCatalog, Avro, HDFS FsImage, CLI tools and tricks helped us to address typical problems related to orchestration of data pipelines and discovery, retention, lineage of datasets.

Introduction to Hive and HCatalog

markgrover

Hadoop Performance Optimization at Scale, Lessons Learned at Twitter

DataWorks Summit

Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug

Cmu-2011-09.pptxTed Dunning

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial: 1) Thinking in Map / Reduce 2) Understanding Unix Pipeline 3) Examples to understand MapReduce 4) Merging 5) Mappers & Reducers 6) Mapper Example 7) Input Split 8) mapper() & reducer() Code 9) Example - Count number of words in a file using MapReduce 10) Example - Compute Max Temperature using MapReduce 11) Hands-on - Count number of words in a file using MapReduce on CloudxLab

HBase with MapRTomer Shiran

If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8 An introduction to Apache Flume that comes from Hadoop Administrator Training delivered by GetInData. Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data. By reading these slides, you will learn about Apache Flume, its motivation, the most important features, architecture of Flume, its reliability guarantees, Agent's configuration, integration with the Apache Hadoop Ecosystem and more.

Hadoop hbase mapreduce

FARUK BERKSÖZ

Analyzing Real-World Data with Apache DrillTomer Shiran

Introduction to Apache Pig

Jason Shao

Apache Drill - Why, What, How

mcsrivas

Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.

HCatalog Hadoop Summit 2011

Hortonworks

Performance Hive+Tez 2t3rmin4t0r

The Hadoop Ecosystem

Mathias Herberts

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2xkCd84 This CloudxLab Introduction to Hive tutorial helps you to understand Hive in detail. Below are the topics covered in this tutorial: 1) Hive Introduction 2) Why Do We Need Hive? 3) Hive - Components 4) Hive - Limitations 5) Hive - Data Types 6) Hive - Metastore 7) Hive - Warehouse 8) Accessing Hive using Command Line 9) Accessing Hive using Hue 10) Tables in Hive - Managed and External 11) Hive - Loading Data From Local Directory 12) Hive - Loading Data From HDFS 13) S3 Based External Tables in Hive 14) Hive - Select Statements 15) Hive - Aggregations 16) Saving Data in Hive 17) Hive Tables - DDL - ALTER 18) Partitions in Hive 19) Views in Hive 20) Load JSON Data 21) Sorting & Distributing - Order By, Sort By, Distribute By, Cluster By 22) Bucketing in Hive 23) Hive - ORC Files 24) Connecting to Tableau using Hive 25) Analyzing MovieLens Data using Hive 26) Hands-on demos on CloudxLab

Hive+Tez: A performance deep dive

t3rmin4t0r

Real-Time Data Loading from MySQL to Hadoop

Continuent

Hadoop is an increasingly popular means of analyzing transaction data from MySQL. Up until now mechanisms for moving data between MySQL and Hadoop have been rather limited. Continuent Tungsten Replicator provides enterprise-quality replication from MySQL to Hadoop under a GPL V2 license. Continuent Tungsten handles MySQL transaction types including INSERT/UPDATE/DELETE operations and can materialize binlogs as well as mirror-image data copies in Hadoop. Continuent Tungsten also has the high performance necessary to load data from busy source MySQL systems into Hadoop clusters with minimal load on source systems as well as Hadoop itself. This webinar covers the following topics: - How Hadoop works and why it's useful for processing transaction data from MySQL - Setting up Continuent Tungsten replication from MySQL to Hadoop - Transforming MySQL data within Hadoop to enable efficient analytics - Tuning replication to maximize performance. You do not need to be an expert in Hadoop or MySQL to benefit from this webinar. By the end listeners will have enough background knowledge to start setting up replication between MySQL and Hadoop using Continuent Tungsten. The software we are discussing is 100% open source and available from the Tungsten Replicator website at code.google.com.

Hive Anatomy

nzhang

Pptx presentNitish Bhardwaj

Hadoop Tutorial

awesomesos

An intriduction to hiveReza Ameri

Batch is Back: Critical for Agile Application Adoption

DataWorks Summit/Hadoop Summit

Quick Introduction to Apache Tez

GetInData

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa

Streaming analytics better than batch when and why - (Big Data Tech 2017)

GetInData

If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8 While a lot of problems can be solved in batch, the stream processing approach currently gives you more benefits. And it’s not only sub-second latency at scale. But mainly possibility… to express accurate analytics with little effort – something that is hard or usually ignored with older batch technologies like Pig, Scalding, Spark or even established stream processors like Storm or Spark Streaming. In this talk we’ll use a real-world example of user session analytics to give you a use-case driven overview of business and technical problems that modern stream processing technologies like Flink help you solve, and benefits you can get by using them today for processing your data as a stream.

What's hot

Overview of Spark for HPC

Glenn K. Lockwood

Hadoop Architecture_Cluster_Cap_Plan

Narayana B

Hadoop MapReduce Streaming and Pipes

Hanborq Inc.

Apache Flume

GetInData

Hadoop hbase mapreduce

FARUK BERKSÖZ

Analyzing Real-World Data with Apache DrillTomer Shiran

Introduction to Apache Pig

Jason Shao

Apache Drill - Why, What, How

mcsrivas

HCatalog Hadoop Summit 2011

Hortonworks

Performance Hive+Tez 2t3rmin4t0r

The Hadoop Ecosystem

Mathias Herberts

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Hive+Tez: A performance deep dive

t3rmin4t0r

Real-Time Data Loading from MySQL to Hadoop

Continuent

Hive Anatomy

nzhang

Pptx presentNitish Bhardwaj

Hadoop Tutorial

awesomesos

An intriduction to hiveReza Ameri

Batch is Back: Critical for Agile Application Adoption

DataWorks Summit/Hadoop Summit

What's hot (19)

Overview of Spark for HPC

Hadoop Architecture_Cluster_Cap_Plan

Hadoop MapReduce Streaming and Pipes

Apache Flume

Hadoop hbase mapreduce

Analyzing Real-World Data with Apache Drill

Introduction to Apache Pig

Apache Drill - Why, What, How

HCatalog Hadoop Summit 2011

Performance Hive+Tez 2

The Hadoop Ecosystem

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab

Hive+Tez: A performance deep dive

Real-Time Data Loading from MySQL to Hadoop

Hive Anatomy

Pptx present

Hadoop Tutorial

An intriduction to hive

Batch is Back: Critical for Agile Application Adoption

Viewers also liked

Quick Introduction to Apache Tez

GetInData

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa

Streaming analytics better than batch when and why - (Big Data Tech 2017)

GetInData

Apache Hadoop In Theory And Practice

Adam Kawa

Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit

Building Hadoop Data Applications with Kite by Tom White

The Hive

With a such a large number of components in the Hadoop ecosystem, writing Hadoop applications can be a big challenge for newcomers. In this talk Tom looks at best practices for building data applications that run on Hadoop, and introduces the Kite SDK, an open source project created at Cloudera with the goal of simplifying Hadoop application development by codifying many of these best practices. Meet with Tom White: Tom White is one of the foremost experts on Hadoop. He has been an Apache Hadoop committer since February 2007, and is a Member of the Apache Software Foundation. Tom is a software engineer at Cloudera, where he has worked, since its foundation, on the core distributions from Apache and Cloudera. Previously he was an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O’Reilly, java.net and IBM’s developerWorks, and has spoken at many conferences, including ApacheCon and OSCON. Tom has a B.A. in mathematics from the University of Cambridge and an M.A. in philosophy of science from the University of Leeds, UK. He currently lives in Wales with his family.

Apache Hadoop YARN, NameNode HA, HDFS Federation

Adam Kawa

Apache Flume

Arinto Murdopo

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Adam Kawa

At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.

Apache Spark Overview

airisData

Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)

Adam Kawa

Link to video: http://www.youtube.com/watch?v=_GNbn4RzZcQ A typical day of a data engineer at Spotify revolves around Hadoop and music. However after some time of simultaneous developing MapReduce jobs, maintaining a cluster and listening to perfect music, something surprising might happen.. What? Well, a data engineer starts discovering Hadoop concepts in the lyrics of many songs! How can Coldplay, The Black Eyed Peas, Michael Jackson sing about Hadoop? (more at blog: http://hakunamapdata.com/hadoop-playlist-at-spotify/)

Apache Flume NG

huguk

Tune up Yarn and Hive

rxu

HDFS Federation

Hortonworks

Introduction To Elastic MapReduce at WHUG

Adam Kawa

Data Aggregation At Scale Using Apache Flume

Arvind Prabhakar

Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Steve Hoffman

Apache Avro and You

Eric Wendelin

functional dependencies with example

Siddhi Viradiya

Hadoop as data refinery

Steve Loughran

Viewers also liked (20)

Quick Introduction to Apache Tez

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)

Streaming analytics better than batch when and why - (Big Data Tech 2017)

Apache Hadoop In Theory And Practice

Apache Tez - A New Chapter in Hadoop Data Processing

Building Hadoop Data Applications with Kite by Tom White

Apache Hadoop YARN, NameNode HA, HDFS Federation

Apache Flume

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Apache Spark Overview

Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)

Apache Flume NG

Tune up Yarn and Hive

HDFS Federation

Introduction To Elastic MapReduce at WHUG

Data Aggregation At Scale Using Apache Flume

Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Apache Avro and You

functional dependencies with example

Hadoop as data refinery

Similar to Introduction to Hadoop Ecosystem

Gdb basics for my sql db as (percona live europe 2019)

Valerii Kravchuk

A Fast Intro to Fast Query with ClickHouse, by Robert Hodges

Altinity Ltd

Slides for the Webinar, presented on March 6, 2019 For the webinar video visit https://www.altinity.com/ Extracting business insight from massive pools of machine-generated data is the central analytic problem of the digital era. ClickHouse data warehouse addresses it with sub-second SQL query response on petabyte-scale data sets. In this talk we'll discuss the features that make ClickHouse increasingly popular, show you how to install it, and teach you enough about how ClickHouse works so you can try it out on real problems of your own. We'll have cool demos (of course) and gladly answer your questions at the end. Speaker Bio: Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)

Practical Groovy DSLGuillaume Laforge

Orchestrating Big Data pipelines @ Fandom - Krystian Mistrzak Thejas Murthy

Evention

Fandom is the largest entertainment fan site in the world. With more than 360,000 fan communities and a global audience of over 190 million monthly uniques, we are the fan’s voice in entertainment. Being the largest entertainment site, wikia generates massive volumes of data, which varies from clickstream, user activities, api requests, ad delivery, A/B testing and much more. The big challenge is not just the volume but the orchestration involved in combining various sources of data with various periodicity, volumes. And Making sure the processed data is available for the consumers within the expected time. Thus helping gain the right insights well within the right time. A conscious decision was made to choose the right open source tool to solve the problem of orchestration, after evaluating various tools we decided to use Apache airflow. This presentation will give an overview of comparisons of existing tools and emphasize on why we choose airflow. And how Airflow is being used to create a stable reliable orchestration platform to enable non data engineers to seamlessly access data by democratizing data. We will focus on some tricks and best practises of developing workflows with Airflow and show how we are using some of the features of airflow.

Practical Domain-Specific Languages in GroovyGuillaume Laforge

Tales from the FieldMongoDB

The Kitchen Cloud How To: Automating Joyent SmartMachines with Chef

Chef Software, Inc.

Learning a new OS can be intimidating, especially one with less support in terms of open source Chef cookbooks. At Wanelo we’ve found the rewards of using Chef with Joyent’s SmartOS to be well worth the effort. SmartOS is an open source fork of Illumos (think Solaris) that runs in the Joyent Public Cloud. Over the last year we’ve grown to love SmartOS as a deployment environment, and with the help of Chef have grown Wanelo’s infrastructure more than ten times in six months to meet the demands our exponential user growth. In the next year, we expect to grow our infrastructure by another factor of ten. On another public cloud, our business growth would have required a significantly larger infrastructure at every step. In this session I’ll explain why we appreciate SmartOS so much and how you can get started. What’s the terminology? What plugins do you need, and how do you use them? What providers should you learn and where can you find them? I’ll provide bootstrap scripts, basic roles and cookbooks on Github to get people provisioning and using SmartMachines immediately. For larger infrastructures, I’ll walk through some of the dependencies that have made our lives easier, and explain why. By the end, you should have the code at your fingertips to deploy a Ruby or Rails application to the Joyent Public Cloud, with all of the dependent services up and running.

Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData

InfluxData

MySQL High Availability Sprint: Launch the Pacemaker

hastexo

Integrating ChatGPT with Apache Airflow

Tatiana Al-Chueyr

NetFlow Data processing using Hadoop and Vertica

Josef Niedermeier

Chef on SmartOSEric Saxby

Are your ready for in memory applications?

G2MCommunications

Screaming Fast Wpmudjcp

2014 hadoop wrocław jugWojciech Langiewicz

H2O on Hadoop Dec 12

Sri Ambati

Big data nyu

Edward Capriolo

Redis at LINE

LINE Corporation

Virtual training optimizing the tick stack

InfluxData

Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.

Similar to Introduction to Hadoop Ecosystem (20)

Gdb basics for my sql db as (percona live europe 2019)

A Fast Intro to Fast Query with ClickHouse, by Robert Hodges

Practical Groovy DSL

Orchestrating Big Data pipelines @ Fandom - Krystian Mistrzak Thejas Murthy

Practical Domain-Specific Languages in Groovy

Tales from the Field

The Kitchen Cloud How To: Automating Joyent SmartMachines with Chef

Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData

MySQL High Availability Sprint: Launch the Pacemaker

Integrating ChatGPT with Apache Airflow

NetFlow Data processing using Hadoop and Vertica

Chef on SmartOS

Are your ready for in memory applications?

Screaming Fast Wpmu

2014 hadoop wrocław jug

H2O on Hadoop Dec 12

Big data nyu

Redis at LINE

Virtual training optimizing the tick stack

Hw09 Production Deep Dive With High Availability

More from GetInData

Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf

GetInData

Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots. In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms. Why do we need yet another (open-source ) Copilot? How can we build one? Architecture and evaluation

How do we work with customers on Big Data / ML / Analytics Projects using Scr...

GetInData

How do we work with our customers ? How does it look? What do the meetings look like ? How do we structure the cooperation? Who does what and when ? We receive these kinds of questions quite often. They are very important questions as the customer should know the details before we start the project and it’s important for GetInData to be transparent on this so the client is well informed. During the webinar our Project Lead, Rafał Zalewski talked about Scrum Framework we use in cooperation with our customers. Watch here: https://www.youtube.com/watch?v=uOWrgcaKwWo&t=32s Speaker: Rafał Zalewski, GetInData: https://www.linkedin.com/in/rafalzalewski/ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz

GetInData

Watch video here: https://youtu.be/sfowpU90zFM Piotr's presentation about GetInData’s Data-Driven Fast Track, the 3-step framework for data transformation. You will learn: ➡ How to assess how data-driven your company is ➡ How to generate ideas for new initiatives to push your company towards better decisions ➡ How to think about implementing these initiatives to increase your chances of success If you miss it live don't despair. Watch the video and feel free to diagnose your company by filling out the survey prepared by our team here: https://bit.ly/3fKcRrb! After completing the survey, you will receive a tailored summary report with insights from one of our experts. Below you'll find links to all the materials mentioned in the workshop needed for exercises. LINKS TO MATERIALS ABOUT DATA-DRIVEN: Data-driven fast-track: 3 steps to make your company more data-driven: https://getindata.com/blog/data-drive... Is my company data-driven? Here’s how you can find out: https://getindata.com/blog/is-my-comp... If you: ➡ have questions about webinar topic, ➡ want to talk about your data-driven transformation, ➡ want to become more data-driven and you need consultations, don't hesitate to write to us: hello@getindata.com Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

How NOT to win a Kaggle competition

GetInData

If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8 Presentation from the performance given by Piotr Chaberski and Adrian Dembek the Data Science Summit ML Edition. Authors: Piotr Chaberski, Adrian Dembek Linkedin: https://www.linkedin.com/in/piotrchaberski/ https://www.linkedin.com/in/adriandembek/ ___ Company: Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

How to become good Developer in Scrum Team?

GetInData

Speaker: Rafał Zalewski, GetInData: https://www.linkedin.com/in/rafalzalewski/ Abstract: To become good Developer in Scrum Team you need to understand not only Scrum Events but also Scrum fundaments like Scrum Pillars and Scrum Values. In this presentation you will learn and understand the mindset expectation from you as Developer in Scrum Team. You will also learn how Scrum mindset helps to achieve better development results. ____ Company: Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

OpenLineage & Airflow - data lineage has never been easier

GetInData

If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8 Presentation from the performance given by Paweł during the Airflow Summit 2022. Author: Paweł Leszczyński Linkedin: https://www.linkedin.com/in/pawel-leszczynski/ ___ Company: Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Benefits of a Homemade ML Platform

GetInData

Did you like it? Check out our blog to stay up to date: https://getindata.com/blog Building your own platform is often ostracized these days. Everyone is encouraged to reuse existing solutions for known reasons. But using a ready-made platform / tool should not be a mindless process. Reusability is an art. During this presentation, you will learn why we decided to build our own MLOps platform while not re-inventing the wheel by using ready-made components with a touch of custom components. What are the benefits of this, but also what limitations and hurdles we have encountered. We hope that our experience will help you make the right decisions in your projects. Sometimes, maybe more risky ones.

Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData

GetInData

If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8 Presentation from the performance given by Mariusz during the Data Science Summit ML Edition. Author: Mariusz Strzelecki Linkedin: https://www.linkedin.com/in/mariusz-strzelecki/ ___ Company: Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...

GetInData

Did you like it? Check out our blog to stay up to date: https://getindata.com/blog This workshop focuses on creating a data streaming platform from scratch using an empty Kubernetes (or even Minikube) cluster. During the workshop, we go through the installation process, deploy the basic components for the platform, start Apache Flink, and monitor the process, using SQL to query available data. Author: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

MLOps implemented - how we combine the cloud & open-source to boost data scie...

GetInData

Check out more about this presentation here: https://www.youtube.com/watch?v=nSsssYHiylQ&t=17s Presentation from the performance given by our team during the NSML Summit. Authors: Krzysztof Zarzycki, Marek Wiewiórka Linkedin: https://www.linkedin.com/in/kzarzycki/ https://www.linkedin.com/in/marekwiewiorka/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

GetInData

Did you like it? Check out our E-book: Apache NiFi - A Complete Guide https://ebook.getindata.com/apache-nifi-complete-guide Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers. Author: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Feast + Amundsen Integration - Mariusz Strzelecki, GetInData

GetInData

Read more here: https://getindata.com/blog/machine-learning-features-discovery-feast-amundsen Author: Mariusz Strzelecki Linkedin: https://www.linkedin.com/in/mariusz-strzelecki/ ___ Company: Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Kubernetes and real-time analytics - how to connect these two worlds with Apa...

GetInData

Did you like it? Check out our blog to stay up to date: https://getindata.com/blog More and more services are running in Kubernetes so it means that we can migrate our current data pipelines to the new environment. In case of Flink we have multiple ways to do real-time data streaming: use Lyft or GCP operator, go with official deployment and customize it or choose the Ververica Platform or create something on your own. The presentation shows how to choose the right solution for technical requirements and business needs to run Flink in Kubernetes at great scale with no issues. Author: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Big data trends - Krzysztof Zarzycki, GetInData

GetInData

If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8 Get more info here: https://getindata.com/blog/6-big-data-trends-2021-bigdata-blog/ Author: Krzysztof Zarzycki Linkedin: https://www.linkedin.com/in/kzarzycki/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...

GetInData

Did you like it? Check out our blog to stay up to date: https://getindata.com/blog The talk is focused on administration, development and monitoring platform with Apache Spark, Apache Flink and Kubeflow in which the monitoring stack is based on Prometheus stack. Author: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...

GetInData

Check out more about this presentation here: https://www.youtube.com/watch?v=eqNToHn4yB0 The webinar was organized by GetinData on 2020. During the webinar we explaned what does it mean to build a data-driven company. Watch more here: https://www.youtube.com/watch?v=eqNToHn4yB0 Speaker: Rafał Małanij ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Monitoring in Big Data Platform - Albert Lewandowski, GetInData

GetInData

Did you like it? Check out our blog to stay up to date: https://getindata.com/blog The webinar was organized by GetinData on 2020. During the webinar we explaned the concept of monitoring and observability with focus on data analytics platforms. Watch more here: https://www.youtube.com/watch?v=qSOlEN5XBQc Whitepaper - Monitoring ang Observability for Data Platform: https://getindata.com/blog/white-paper-big-data-monitoring-observability-data-platform/ Speaker: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Complex event processing platform handling millions of users - Krzysztof Zarz...

GetInData

If you want to learn more about it, check out our webinar here: https://www.youtube.com/watch?v=EfGPY_NyYQ8&t=77s The webinar was organized by GetinData on 2020. During the webinar, we shared our lessons learnt from building and running stream processing platform in production for over 2 years. Watch more here: https://www.youtube.com/watch?v=EfGPY_NyYQ8 Author: Krzysztof Zarzycki Linkedin: https://www.linkedin.com/in/kzarzycki/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Predicting Startup Market Trends based on the news and social media - Albert ...

GetInData

Did you like it? Check out our blog to stay up to date: https://getindata.com/blog Nowadays, one tweet can have impact on the value of the company or cryptocurrency. It becomes important for companies to be able to know everything what's happening in the market, especially for startups or when entering the new market. The presentation is about presenting the complex platform used for creating and verifying the strategy for a startup from the Wellbeing market. We go through web scraping-based data ingestion to ElasticSearch, NLP pipelines to understand what people write and what is the possible future of each market predicted by PySpark job. Author: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Managing Big Data projects in a constantly changing environment - Rafał Zalew...

GetInData

Watch our full performance given by our team during the Big Data Technology Warsaw Summit: https://www.youtube.com/watch?v=CBrq7z8ikaM The nature of Big Data projects are nowadays one of its kind - they are not like the data warehousing initiatives in the old days, nor like cloud native applications projects, at least not yet. Variety of technologies, complicated architectures and rapidly changing landscape are just a few challenges that the IT Department is facing in such projects. When you add the number of stakeholders from different departments involved and that Big Data project is sometimes more like an R&D with unpredictable outcome, this makes a mix where the objectives can be easily lost. It is not a surprise that up to 85% of Big Data projects were pure failures (Gartner 2016). In this talk we will share our experience in planning and executing Big Data initiatives in the organisations, with some use cases and good practices in mind Watch our webinar here: https://www.youtube.com/watch?v=CBrq7z8ikaM Speakers: Rafał Małanij Rafał Zalewski Linkedin: https://www.linkedin.com/in/rafalzalewski/ ___ Company: Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

More from GetInData (20)

Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf

How do we work with customers on Big Data / ML / Analytics Projects using Scr...

Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz

How NOT to win a Kaggle competition

How to become good Developer in Scrum Team?

OpenLineage & Airflow - data lineage has never been easier

Benefits of a Homemade ML Platform

Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData

Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...

MLOps implemented - how we combine the cloud & open-source to boost data scie...

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

Feast + Amundsen Integration - Mariusz Strzelecki, GetInData

Kubernetes and real-time analytics - how to connect these two worlds with Apa...

Big data trends - Krzysztof Zarzycki, GetInData

Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...

Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...

Monitoring in Big Data Platform - Albert Lewandowski, GetInData

Complex event processing platform handling millions of users - Krzysztof Zarz...

Predicting Startup Market Trends based on the news and social media - Albert ...

Managing Big Data projects in a constantly changing environment - Rafał Zalew...

Recently uploaded

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

Dr. Sean Tan, Head of Data Science, Changi Airport Group Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.

Mind map of terminologies used in context of Generative AI

Kumud Singh

Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI

Vladimir Iglovikov, Ph.D.

Presented by Vladimir Iglovikov: - https://www.linkedin.com/in/iglovikov/ - https://x.com/viglovikov - https://www.instagram.com/ternaus/ This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation. Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners. This case study covers various aspects, including: People: The contributors and community that have supported Albumentations. Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions. Challenges: The hurdles in monetizing open-source projects and measuring user engagement. Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration. Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community. Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations. Mental Health: Maintaining balance and not feeling pressured by user demands. Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth. Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects. Explore more about Albumentations and join the community at: GitHub: https://github.com/albumentations-team/albumentations Website: https://albumentations.ai/ LinkedIn: https://www.linkedin.com/company/100504475 Twitter: https://x.com/albumentations

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Neo4j

Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

SOFTTECHHUB

As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.

DevOps and Testing slides at DASA Connect

Kari Kakkonen

By Design, not by Accident - Agile Venture Bolzano 2024

Pierluigi Pugliese

National Security Agency - NSA mobile device best practices

Quotidiano Piemontese

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

Alex Pruden

This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second). Paper: https://eprint.iacr.org/2023/1886

RESUME BUILDER APPLICATION Project for students

KAMESHS29

20240607 QFM018 Elixir Reading List May 2024

Matthew Sinclair

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered: • Communication Mining Overview • Why is it important? • How can it help today’s business and the benefits • Phases in Communication Mining • Demo on Platform overview • Q/A

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Neo4j

GridMate - End to end testing is a critical piece to ensure quality and avoid...

ThomasParaiso2

Microsoft - Power Platform_G.Aspiotis.pdf

Uni Systems S.M.S.A.

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Recently uploaded (20)

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Mind map of terminologies used in context of Generative AI

Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

DevOps and Testing slides at DASA Connect

By Design, not by Accident - Agile Venture Bolzano 2024

National Security Agency - NSA mobile device best practices

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

GraphRAG is All You need? LLM & Knowledge Graph

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

RESUME BUILDER APPLICATION Project for students

20240607 QFM018 Elixir Reading List May 2024

Essentials of Automations: The Art of Triggers and Actions in FME

Communications Mining Series - Zero to Hero - Session 1

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

GridMate - End to end testing is a critical piece to ensure quality and avoid...

Microsoft - Power Platform_G.Aspiotis.pdf

Epistemic Interaction - tuning interfaces to provide information for AI support

Introduction to Hadoop Ecosystem

2. © ■ ■ ● ●

5. © ■ ● ● ●

6. ©

7. ©

8. © ■ ■ ■

11. © ■

13. © $ hdfs dfs -ls /user/tiger $ hdfs dfs -put songs.txt /user/tiger $ hdfs dfs -cat /user/tiger/songs.txt $ hdfs dfs -mkdir songs $ hdfs dfs -mv songs.txt songs $ hdfs dfs -rmr songs

19. © ■ ■

22. ©

23. ©

24. © ■ ■ ■ ■

25. © 1. Offers compute resources such as CPU and RAM 2. Runs tasks of the applications submitted by users 3. Reports to the Master

26. © 1. Knows about all Slaves 2. Knows about available and occupied resources on each Slave 3. Schedules jobs submitted by clients

28. © 1. Started and overseen by Resource Manager 2. Coordinates the execution of all tasks within an application 3. Asks for resources needed to run its tasks 4. Runs on the Node Manager

30. © ■ ■ ● ■

33. © 1. NodeManagers should be collocated with DataNodes 2. The Resource Manager tries to schedule tasks on a node which is the closest to the data 3. Large volumes of data don’t have to be sent over the network

34. © ■

36. © HADOOP MR MR SOME MAGIC 1. Parses query 2. Plans execution 3. Submits jobs 4. Monitors jobs 5. Returns results Execution SELECT trackid, COUNT(*) AS cnt FROM stream GROUP BY trackid ORDER BY cnt DESC; Results

37. © HADOOP MR MR APACHE HIVE Results 1. Parses query 2. Plans execution 3. Submits jobs 4. Monitors jobs 5. Returns results Execution SELECT trackid, COUNT(*) AS cnt FROM stream GROUP BY trackid ORDER BY cnt DESC;

38. ©

39. ©

40. ©

41. ©

45. © Hive Shell CLI BeesWax HUE RDBMS Hive Metastore Acts as a proxy for “ligth” clients JDBC/ODBC Hive Server 2 Beeline CLI

46. ©

47. © ■ ■

48. ©

49. © Job 1 Job 2 Possible to cache dataset in cluster’s (distributed) memory to read it faster in next jobs HDFS Read Memory Read Cache In Memory Cache In Memory Memory Read

50. © Job 1 Job 2 Great fit for iterative algorithms and interactive queries! HDFS Read Memory Read Cache In Memory Cache In Memory Possible to cache dataset in cluster’s (distributed) memory to read it faster in next jobs Memory Read

51. © Interactive queries Iterative algorithms Input Query 2 Query 1 Query 3 Input Iteration 1 Iteration 2 Distributed Memory

52. © NodeManager Client YARN Container Spark Application Master Spark Driver Resource Manager NodeManager YARN Container Spark Executor Spark Task NodeManager YARN Container Spark Executor Spark Task

53. © ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 20g --executor-cores 3 lib/spark-examples*.jar 10

54. © ■ Spark Core Spark SQL Spark Streaming (near real-time, micro-batch) MLlib (machine learning) GraphFrames (graph processing) SparkR (R on Spark)

Introduction to Hadoop Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Hadoop Ecosystem

Similar to Introduction to Hadoop Ecosystem (20)

More from GetInData

More from GetInData (20)

Recently uploaded

Recently uploaded (20)

Introduction to Hadoop Ecosystem