Planning and Optimizing Data Lake Architecture - Milos Milovanovic

•

5 likes•1,374 views

The main idea of a Data Lake is to expose the company data in an agile and flexible way to the people within the company, but preserve safeguard and auditing features that are required for the company’s critical data. The way that most projects in this direction start out is by depositing all of the data in Hadoop, trying to infer the schema on top of the data and then use the data for analytics purposes via Hive or Spark. Described stack is a really good approach for many use cases, as it provides cheaply storing data in files and rich analytics on top. But many pitfalls and problems might show up on this road, which can be easily met by extending the toolset. The potential bottlenecks will be displayed as soon as the users arrive and start exploiting the Lake. From all of these reasons, planning and building a Data Lake within an organization requires strategic approach, in order to build an architecture that can support it.

Data & Analytics

Milos Milovanovic, Co-Founder & Data Engineer @ Things Solver
milos@thingsolver.com
milos@datascience.rs
Planning and Optimizing
Data Lake Architecture

Agenda
Introduction - Business Data Requirements
What is A Data Lake?
A Common Data Lake Architecture
When Problems Start To Show Up - Optimizing Data Lake
Expanding a Data Lake
How To Plan Data Lake - Success Factors

Introduction - Business Data Requirements
Main goal for organizations is to adapt and put all of their data into use.
It’s not an easy task - it might require the mindset and structural changes.
Flexibility and agility are required for success.
Various trends and buzzwords are making it hard to stay on track.
Challenge of Transforming Enterprise Data Management - (“The data lake is a
foundational component and common denominator of the modern data architecture
enabling, and complementing specialized components, such as enterprise data
warehouses, discovery-oriented environments, and highly-specialized analytic or
operational data technologies…” - John O’Brien, CEO @ Radiant Advisors)

Data Lake - The Very First Definition
“If you think of a datamart as a store of bottled water – cleansed and packaged
and structured for easy consumption – the data lake is a large body of water in a
more natural state. The contents of the data lake stream in from a source to fill
the lake, and various users of the lake can come to examine, dive in, or take
samples.”
- James Dixon, CTO @ Pentaho

A More Formal Definition
“A data lake is a storage repository that holds a
vast amount of raw data in its native format,
including structured, semi-structured, and
unstructured data. The data structure and
requirements are not defined until the data is
needed.”

Data Warehouse & Data Lake by Example
Social Media Streaming can be implemented using traditional Data Warehouse
… but such an application will be to restricted and inflexible (extending the
number of columns analyzed).
Using Data Lake for this purpose gives us flexibility to adapt and test new metrics
… and we can easily add new applications on top.

A Common Data Lake Implementation Architecture
❏ In general, the architecture of a data lake is simple: a Hadoop File System
(HDFS) with lots of directories and files on it.
❏ Hadoop is usually in the center of Data Lake Architecture, although the concept
is broader than Hadoop.
❏ Hadoop’s scalable, low-cost persistence layer and its ability to perform big data
processing and analytics is a great toolset to achieve measurable business value
opportunities at speed and low cost.
❏ Hive and Spark provide us rich analytics on top of the data that is persisted at
low cost.

This Architecture:
Acts like SQL
Efficient and Scalable
Connects to Basically Anything
Different Processing Modes
(Realtime, Batch, Pipelines, Machine
Learning, Ad Hoc Analysis …)
HADOOP
DISTRIBUTED
FILE
SYSTEM
HIVE AND SPARK
DATA SOURCES

When Problems Show Up
Hadoop + Spark/Hive != Database
- Searching a row within TBs of Data
select * from my_table where some_column like ’%123asd%’;
- No updates and deletes
- Too many concurrent requests from BI Tools
...
Spark Best Practice: http://go.databricks.com/not-your-fathers-database

How Do We Optimize Such a Solution?
❏ Use ORC File Format
❏ File Compaction (small files, deduplication)
❏ Run Spark on YARN
❏ Use Spark Dataframes
❏ Data Caching
❏ Use Traditional Databases
❏ Extend the Toolset (Solr, ES, Kafka, Redis, …)

Data Lake - Extended Toolset
HDFS
AND MANY MORE...

How To Start With The Data Lake?
❏ Think of the Use Cases (don’t plan all the use cases - have some in mind)
❏ Master the Technology
❏ Go agile and flexible
❏ Do not forget about the Data Governance, Data Quality, Security (but do not
drown in this)
❏ Integrate with BI and DWH

Make data accessible and let Data
Scientists go fishing in the Lake.

Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.

Building a Data Lake - An App Dev's Perspective

GeekNightHyderabad

Speaker: Geetha Balasundaram, Developer at ThoughtWorks From tools and technology to people and requirements, what's different in the data engineering space? App development is traditional now. All enterprises want to become data-guided. Data lake is good start yet the know-hows and do-hows are so many. Experiences from building a data lake in the retail domain, the talk will be covering. - What is this vast new space of data engineering, - Why it is critical to think in terms of data rather than features - How important it is to understand these technologies and create a data lake that is usable and insightful to business

Designing the Next Generation Data Lake

Robert Chong

Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017

Lviv Startup Club

The Warranty Data Lake – After, Inc.

Richard Vermillion

How to build a successful Data Lake

DataWorks Summit/Hadoop Summit

5 Steps for Architecting a Data Lake

MetroStar

Datalake Architecture

TechYugadi IT Solutions & Consulting

A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration. Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.

Building the Enterprise Data Lake: A look at architecture

mark madsen

The topic is building an Enterprise Data Lake, discussing high level data and technology architecture. We will describe the architecture of a data warehouse, how a data lake needs to differ, and show a high level functional and data architecture for a data lake. This webinar will cover: Why dumping data into Hadoop and letting users get it out doesn't work The difference between a Hadoop application and a Data Lake Why new ideas about data architecture are a key element An Enterprise Data Lake reference architecture to frame what must be built

Big data architectures and the data lake

James Serra

With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!

Data lake benefits

Ricky Barron

Designing modern dw and data lake

punedevscom

Data Lake,beyond the Data Warehouse

Data Science Thailand

Enterprise Data Lake - Scalable Digital

sambiswal

This white paper will present the opportunities laid down by data lake and advanced analytics, as well as, the challenges in integrating, mining and analyzing the data collected from these sources. It goes over the important characteristics of the data lake architecture and Data and Analytics as a Service (DAaaS) model. It also delves into the features of a successful data lake and its optimal designing. It goes over data, applications, and analytics that are strung together to speed-up the insight brewing process for industry’s improvements with the help of a powerful architecture for mining and analyzing unstructured data – data lake.

Incorporating the Data Lake into Your Analytic Architecture

Caserta

Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization. Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture. For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.

Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit

Big Data & Data Lakes Building Blocks

Amazon Web Services

Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.

Anatomy of a data driven architecture - Tamir Dresher

Tamir Dresher

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...

Data Con LA

Data Lake Architecture

DATAVERSITY

Data Lakes are meant to support many of the same analytics capabilities of Data Warehouses while overcoming some of the core problems. Yet Data Lakes have a distinctly different technology base. This webinar will provide an overview of the standard architecture components of Data Lakes. This will include: The Lab and the factory The base environment for batch analytics Critical governance components Additional components necessary for real-time analytics and ingesting streaming data

Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...

NoSQLmatters

Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.

Introduction to Microsoft’s Hadoop solution (HDInsight)

James Serra

Did you know Microsoft provides a Hadoop Platform-as-a-Service (PaaS)? It’s called Azure HDInsight and it deploys and provisions managed Apache Hadoop clusters in the cloud, providing a software framework designed to process, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution that includes many Hadoop components such as HBase, Spark, Storm, Pig, Hive, and Mahout. Join me in this presentation as I talk about what Hadoop is, why deploy to the cloud, and Microsoft’s solution.

The Data Lake and Getting Buisnesses the Big Data Insights They Need

Dunn Solutions Group

Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR

Modern data warehouse

Stephen Alex

Hadoop Big Data Lakes Keynote

Mark van Rijmenam

Why Data Lake should be the foundation of Enterprise Data Architecture

Agilisium Consulting

Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...

DataWorks Summit

Finance Data Lake objective is to create a centralized enterprise data repository for all Finance and Supply Chain data. It serves as the single source of truth. It enables a self-service discovery Analytics platform for business users to answer adhoc business questions and derive critical insights. The data lake is based on open source Hadoop big data platform and a very cost effective solution in breaking the ERP data silos and simplifying the data architecture in the enterprise. POCs were conducted on in-house Hortonworks Hadoop data platform to validate the cluster performance for Production volumes. Based on business priorities, an initial roadmap was defined using 3 data sources including 2 SAP ERPs and Peoplesoft (OLTP systems). Development environment was established in AWS Cloud for agile delivery. The near real time data ingestion architecture for the data lake was defined using replication tools and custom SQOOP based micro-batching framework and data persisted in Apache Hive DB in ORC format. Data and user security is implemented using Apache Ranger and sensitive data stored at rest in encryption zones. Business data sets were developed in Hive scripts and scheduled using Oozie. Multiple reporting tools connectivity including SQL tools, Excel and Tableau were enabled for Self-service Analytics. Upon successful implementation of the initial phase, a full roadmap is established to extend the Finance data lake to over 25 data sources and enhance data ingestion to scale as well as enable OLAP tools on Hadoop.

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...

Hortonworks

How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services? Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business. Join Hortonworks and Informatica as we discuss: - What is a data lake? - The modern data architecture for a data lake - How Hadoop fits into the modern data architecture - Innovative use-cases for a data lake

GeekNight 22.0 Multi-paradigm programming in Scala and Akka

GeekNightHyderabad

What's hot

Data lake

GHAZOUANI WAEL

Building the Enterprise Data Lake: A look at architecture

mark madsen

Big data architectures and the data lake

James Serra

Data lake benefits

Ricky Barron

Designing modern dw and data lake

punedevscom

Data Lake,beyond the Data Warehouse

Data Science Thailand

Enterprise Data Lake - Scalable Digital

sambiswal

Incorporating the Data Lake into Your Analytic Architecture

Caserta

Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit

Big Data & Data Lakes Building Blocks

Amazon Web Services

Anatomy of a data driven architecture - Tamir Dresher

Tamir Dresher

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...

Data Con LA

Data Lake Architecture

DATAVERSITY

Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...

NoSQLmatters

Introduction to Microsoft’s Hadoop solution (HDInsight)

James Serra

The Data Lake and Getting Buisnesses the Big Data Insights They Need

Dunn Solutions Group

Modern data warehouse

Stephen Alex

Hadoop Big Data Lakes Keynote

Mark van Rijmenam

Why Data Lake should be the foundation of Enterprise Data Architecture

Agilisium Consulting

Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...

DataWorks Summit

What's hot (20)

Data lake

Building the Enterprise Data Lake: A look at architecture

Big data architectures and the data lake

Data lake benefits

Designing modern dw and data lake

Data Lake,beyond the Data Warehouse

Enterprise Data Lake - Scalable Digital

Incorporating the Data Lake into Your Analytic Architecture

Hadoop Powers Modern Enterprise Data Architectures

Big Data & Data Lakes Building Blocks

Anatomy of a data driven architecture - Tamir Dresher

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...

Data Lake Architecture

Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...

Introduction to Microsoft’s Hadoop solution (HDInsight)

The Data Lake and Getting Buisnesses the Big Data Insights They Need

Modern data warehouse

Hadoop Big Data Lakes Keynote

Why Data Lake should be the foundation of Enterprise Data Architecture

Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...

Viewers also liked

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...

Hortonworks

GeekNight 22.0 Multi-paradigm programming in Scala and Akka

GeekNightHyderabad

An adaptive and eventually self healing framework for geo-distributed real-ti...

Angad Singh

Designing a Real Time Data Ingestion Pipeline

DataScience

Creating a Modern Data Architecture

Zaloni

Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...

Zaloni

10 Amazing Things To Do With a Hadoop-Based Data Lake

VMware Tanzu

Building the Data Lake with Azure Data Factory and Data Lake Analytics

Khalid Salama

In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.

The Emerging Data Lake IT Strategy

Thomas Kelly, PMP

Meaning making – separating signal from noise. How do we transform the customer's next input into an action that creates a positive customer experience? We make the data more intelligent, so that it is able to guide our actions. The Data Lake builds on Big Data strengths by automating many of the manual development tasks, providing several self-service features to end-users, and an intelligent management layer to organize it all. This results in lower cost to create solutions, "smart" analytics, and faster time to business value.

Implementing a Data Lake with Enterprise Grade Data Governance

Hortonworks

Hadoop provides a powerful platform for data science and analytics, where data engineers and data scientists can leverage myriad data from external and internal data sources to uncover new insight. Such power is also presenting a few new challenges. On the one hand, the business wants more and more self-service, and on the other hand IT is trying to keep up with the demand for data, while maintaining architecture and data governance standards. In this webinar, Andrew Ahn, Data Governance Initiative Product Manager at Hortonworks, will address the gaps and offer best practices in providing end-to-end data governance in HDP. Andrew Ahn will be followed by Oliver Claude of Waterline Data, who will share a case study of how Waterline Data Inventory works with HDP in the Modern Data Architecture to automate the discovery of business and compliance metadata, data lineage, as well as data quality metrics.

Pivotal Data Lake Architecture & its role in security analyticsEMC

Hadoop data ingestion

Vinod Nayal

Analysing Smart City Development in india

Omkar Parishwad

India’s recent stand on Smart City Development and involvement of various high income countries; initiates the talk of ideal variables for smart city evolution by our own standards. With a vision of Urban Governance for general livability, it becomes imperative to study these parameters and ensure the evolution of our own concept of a Smart City. Our spatial planning models based on unique factors such as Human Diversity, Physical-Social networks and ICT impact on urban fabric, City resilience, etc. make it all the more interesting to evolve a blueprint for Planning a Smart City. The paper centers the infrastructural developments for the Smart Urban Development in India. The research helps us arrive at a general line of action for Urban Planning implications catering to the Infrastructure Sector, amongst others; thus affecting environmental, social and economic structure significantly. The study further finds the scope of progress, encouraged from various government policies for successful implementation of Smart City Development. It also allows a peek into future scenario of improvements and deliberations particular to Indian standards in consideration with the scenario of other countries.

Viewers also liked (13)

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...

GeekNight 22.0 Multi-paradigm programming in Scala and Akka

An adaptive and eventually self healing framework for geo-distributed real-ti...

Designing a Real Time Data Ingestion Pipeline

Creating a Modern Data Architecture

Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...

10 Amazing Things To Do With a Hadoop-Based Data Lake

Building the Data Lake with Azure Data Factory and Data Lake Analytics

The Emerging Data Lake IT Strategy

Implementing a Data Lake with Enterprise Grade Data Governance

Pivotal Data Lake Architecture & its role in security analytics

Hadoop data ingestion

Analysing Smart City Development in india

Similar to Planning and Optimizing Data Lake Architecture - Milos Milovanovic

So You Want to Build a Data Lake?

David P. Moore

Enterprise Data Lake

sambiswal

Enterprise Data Lake: How to Conquer the Data Deluge and Derive Insights that Matters Data can be traced from various consumer sources. Managing data is one of the most serious challenges faced by organizations today. Organizations are adopting the data lake models because lakes provide raw data that users can use for data experimentation and advanced analytics. A data lake could be a merging point of new and historic data, thereby drawing correlations across all data using advanced analytics. A data lake can support the self-service data practices. This can tap undiscovered business value from various new as well as existing data sources. Furthermore, a data lake can aid data warehousing, analytics, data integration by modernizing. However, lakes also face hindrances like immature governance, user skills and security.

Data lakes

Şaban Dalaman

Is the traditional data warehouse dead?

James Serra

With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.

Ceph Days 2014 Paul Evans Slide Deck

DaystromTech

Big Data Practice_Planning_steps_RKRajesh Jayarman

Agile data lake? An oxymoron?

samthemonad

Vikram Andem Big Data Strategy @ IATA Technology Roadmap

IT Strategy Group

One Large Data Lake, Hold the Hype

Koverse, Inc.

The term "Data Lake" has become almost as overused and undescriptive as "Big Data". Many believe that centralizing datasets in HDFS makes a data lake, but then they struggle to realize any tangible value. This talk will redefine the "Data Lake" by describing four specific, key characteristics that we at Koverse have learned are crucial to successful enterprise data lake deployments. These characteristics are 1) indexing and search across all data sets, 2) interactive access for all users in the enterprise, 3) multi-level access control, and 4) integration with data science tools. These characteristics define a system that lets people realize value from their data versus getting lost in the hype. The talk will go on to provide a technical description of how we have integrated several projects, namely Apache Accumulo, Hadoop, and Spark, to implement an enterprise data lake with these key features.

One Large Data Lake, Hold the Hype

Jared Winick

Data Lakehouse, Data Mesh, and Data Fabric (r2)

James Serra

So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

DATAVERSITY

Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020. Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms. Data lakes will be built in cloud object storage. We’ll discuss the options there as well. Get this data point for your data lake journey.

Data Lakehouse, Data Mesh, and Data Fabric (r1)

James Serra

A beginners guide to Cloudera Hadoop

David Yahalom

Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)

Denodo

Watch full webinar here: https://bit.ly/3aePFcF Historically data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multipurpose data lakes are the future of data analysis for a broad range of business users. Attend this session to learn: - The restrictions of physical single purpose data lakes - How to build a logical multi purpose data lake for business users - The newer use cases that makes multi purpose data lakes a necessity

Prague data management meetup 2018-03-27

Martin Bém

Minimizing the Complexities of Machine Learning with Data Virtualization

Denodo

Watch full webinar here: https://buff.ly/309CZ1Y Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way. Attend this webinar and learn: *How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice *How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo *How you can use the Denodo Platform with large data volumes in an efficient way *About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization

Building a Logical Data Fabric using Data Virtualization (ASEAN)

Denodo

Watch full webinar here: https://bit.ly/3FF1ubd In the recent Building the Unified Data Warehouse and Data Lake report by leading industry analysts TDWI, we have discovered 64% of organizations stated the objective for a unified Data Warehouse and Data Lakes is to get more business value and 84% of organizations polled felt that a unified approach to Data Warehouses and Data Lakes was either extremely or moderately important. In this session, you will learn how your organization can apply a logical data fabric and the associated technologies of machine learning, artificial intelligence, and data virtualization can reduce time to value. Hence, increasing the overall business value of your data assets. KEY TAKEAWAYS: - How a Logical Data Fabric is the right approach to assist organizations to unify their data. - The advanced features of a Logical Data Fabric that assist with the democratization of data, providing an agile and governed approach to business analytics and data science. - How a Logical Data Fabric with Data Virtualization enhances your legacy data integration landscape to simplify data access and encourage self-service.

Difference between Database vs Data Warehouse vs Data Lake

jeetendra mandal

Prague data management meetup 2016-01-12 pub

Martin Bém

Similar to Planning and Optimizing Data Lake Architecture - Milos Milovanovic (20)

So You Want to Build a Data Lake?

Enterprise Data Lake

Data lakes

Is the traditional data warehouse dead?

Ceph Days 2014 Paul Evans Slide Deck

Big Data Practice_Planning_steps_RK

Agile data lake? An oxymoron?

Vikram Andem Big Data Strategy @ IATA Technology Roadmap

One Large Data Lake, Hold the Hype

Data Lakehouse, Data Mesh, and Data Fabric (r2)

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

Data Lakehouse, Data Mesh, and Data Fabric (r1)

A beginners guide to Cloudera Hadoop

Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)

Prague data management meetup 2018-03-27

Minimizing the Complexities of Machine Learning with Data Virtualization

Building a Logical Data Fabric using Data Virtualization (ASEAN)

Difference between Database vs Data Warehouse vs Data Lake

Prague data management meetup 2016-01-12 pub

More from Institute of Contemporary Sciences

First 5 years of PSI:ML - Filip Panjevic

Institute of Contemporary Sciences

Filip Panjevic is a Co-Founder and CTO at ydrive.ai - startup dealing with self-driving cars, and one of the founders of Petnica Machine Learning School. Filip's talk will focus on the story of Petnica School, how did it start, what has changed since the beginning, how the concept of school looks right now and why is that concept good for making new data scientists. This talk will be perfect for people who consider starting their careers in the data science field!

Building valuable (online and offline) Data Science communities - Experience ...

Institute of Contemporary Sciences

The talk will be a broad overview and thoughts about building one of the biggest data science communities in India. I will talk about how an ecosystem is created and value delivered to each stakeholder. I will be sharing my experience of building MachineHack and AIMinds and other platforms. One of the core agendas of the talk will be how these platforms have enabled a unique data science education and learning experience in India. The platforms built help students and engineers to imagine and work towards a career in data science.

Data Science Master 4.0 on Belgrade University - Drazen Draskovic

Institute of Contemporary Sciences

Deep learning fast and slow, a responsible and explainable AI framework - Ahm...

Institute of Contemporary Sciences

PwC's recently released Responsible AI Diagnostic surveyed around 250 senior business executives from May to June 2019. The survey says that 84% of CEOs agree that AI-based decisions need to be explainable in order to be trusted. In the past few years, Deep learning has shown remarkable results in various applications, which makes it one of the first choices for many AI use cases. However, deep learning models are hard to explain, and since the majority of CEOs expect AI solutions to be explainable, deep learning has a serious challenge. Daniel Kahneman, in his book thinking fast and slow, presented two different systems the human brain uses to form thoughts and decisions: System 1: fast, intuitive and hard to explain System 2: slow, conscious and easy to explain In this talk I will present: A) PwC Responsible AI Survey B) A proposed deep learning framework that mimics the two systems of thinking C) The recent advances in the neural symbolic learning field.

Solving churn challenge in Big Data environment - Jelena Pekez

Institute of Contemporary Sciences

Application of Business Intelligence in bank risk management - Dimitar Dilov

Institute of Contemporary Sciences

In my talk I am going to share with the audience a practical experience of using BI solutions for steering bank credit portfolios, make data actionable, communicate and collaborate on that data with relevant stakeholders. In our case, we have aimed for a solution that can use data-models based on Claud and on-premise, easily communicate and share information within the organization and keep track of that information flow. In addition, we want our solution to support various datasets and to have the flexibility of integrating the most popular DS languages – R and Python for the convenience and flexibility of our data science team. Our solution is based on Power BI plus the use of Azure Analytical Service and R.

Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...

Institute of Contemporary Sciences

The talk will have 3 parts. The overview of the practical applications of the AI and ML in the FinTech industry with a short explanation of the PSD2 directive and the disruption is caused. Application of the AI/ML from the perspective of the end-user, personal financial health, financial coach, etc. The overview of the architecture, technologies, and frameworks used with practical examples from the Zuper company.

Recommender systems for personalized financial advice from concept to product...

Institute of Contemporary Sciences

We present a recommender system for personalized financial advice, which we designed for a large Swiss private bank. The final recommendations produced by the system were delivered to the end clients through a mobile banking platform. The recommender system is based on a collaborative filtering technique and can work with changing asset features, operate with implicit ratings and react to explicit feedback that clients can give using the mobile app. Moreover, we developed and implemented an approach to provide an explanation for each recommendation in the form “As you bought A, you might like B".

Advanced tools in real time analytics and AI in customer support - Milan Sima...

Institute of Contemporary Sciences

This talk shall focus on making real-time pipelines using cutting edge Big Data technologies and applying ML on gathered data. The first part of the presentation shall cover importance and necessity for streaming data processing. In addition, tools that could be used in order to build a streaming pipeline shall be proposed. The second part of this talk shall focus on making machine learning models in customer support. There shall be introduced success stories covering the need for more efficient customer support, problem resolution and gained benefits.

Complex AI forecasting methods for investments portfolio optimization - Pawel...

Institute of Contemporary Sciences

Presentation of the first complete AI investment platform. It is based on most innovative AI methods: most advanced neural networks (ResNet/DenseNet, LSTM, GAN autoencoders) and reinforcement learning for risk control and position sizing using Alpha Zero approach. It shows how the complex AI system which covers both supervised and reinforcement learning could be successfully used to investment portfolio optimization in real-time. The architecture of the platform and used algorithms will be presented together with the workflow of machine learning. Also, the real demo of the platform will be shown.

From Zero to ML Hero for Underdogs - Amir Tabakovic

Institute of Contemporary Sciences

Data and data scientists are not equal to money david hoyle

Institute of Contemporary Sciences

A lot of companies make the mistake of thinking that just hiring Data Scientists will lead to increased revenue or increased profit. For a company’s investment in Data Science to be successful the Data Scientists need to work on the right problems, with the right people, and with the right tools. In this presentation, I will talk about the lessons I have learned, and mistakes made in applying Data Science in commercial settings over the last 10 years. I will highlight what processes can increase the chances of Data Science investment being successful.

The price is right - Tomislav Krizan

Institute of Contemporary Sciences

The talk would be focusing on reasons and method for creating models which maximize sales price Gross Margin but still has high confidentiality that quote would be accepted by the client. Price changes are dynamic things that are impacted by many different elements like cost of input material, labor cost, transportation cost, scrap material due to different ordered quantities, etc. Besides input cost segments, output price is also impacted by different marketing campaigns (own and others), seasonality, past and future customer behavior as well as the behavior of the product we are selling.

When it's raining gold, bring a bucket - Andjela Culibrk

Institute of Contemporary Sciences

Reality and traps of real time data engineering - Milos Solujic

Institute of Contemporary Sciences

In the past few years, many businesses started do understand the potential of real-time data analytics. And many of those invested time, energy and finances to make it happen, with weaker outcomes than expected. Reasons are few for this: too ambitious plans by leadership regarding leveraging data, not enough discipline defining goals and MVP for initial use cases, a plethora of tools and vendors available who claim that can solve all the problems, etc. So, how can we get the most value with reasonable costs out of fast (real-time) data? We will try to answer this question and give actionable advice.

Sensor networks for personalized health monitoring - Vladimir Brusic

Institute of Contemporary Sciences

University of Nottingham Ningbo China The advances of 5G, sensor, and information technologies have enabled the proliferation of smart pervasive sensor networks. Rapid progress in the design of biomedical sensors, advances in the management of medical knowledge, and improvement of algorithms for decision support, are fueling a technological disruption to health monitoring. Current technologies enable personalized A3 (anyplace, anytime, anywhere) health monitoring. Continuous health monitoring enables the extension of health care into home and workplace changing the modes of traditional health care delivery. Medical grade systems require innovative solutions for system dependability, medical decision support, data management, and interpretation, beyond current fitness and wellbeing applications. We will present innovative solutions for A3 health monitoring and discuss the use of blockchain technologies, and artificial intelligence addressing technical, medical, and ethical requirements for personalized health monitoring systems.

Improving Data Quality with Product Similarity Search

Institute of Contemporary Sciences

Prediction of good patterns for future sales using image recognition

Institute of Contemporary Sciences

Uroš Valant has almost 20 years of experience in planning, managing and delivering of various IT projects. He has the best and richest experience in the field of business analytics, project planning and implementation, database design and the management of development teams. In the last years, his focus is the field of predictive analytics, machine learning and applying the AI solution to a practical use in different field of work. In his talk he will present to us interactive case study of the image recognition use and AI assisted design techniques in the textile industry.

Using data to fight corruption: full budget transparency in local government

Institute of Contemporary Sciences

The presentation will start as an engaging lecture where I will present the motivation behind the project based on my academic research (my Oxford PhD among others). I will tell the audience just how rampant corruption is in local governance and why is it so persistent. Then I will present our remedy: full budget transparency. I will show them our search engine and how it works, and will call the participants to download the APIs and play with the data themselves.

Geospatial Analysis and Open Data - Forest and Climate

Institute of Contemporary Sciences

The talk will be divided into two parts. The first one is about geospatial open data and several Copernicus services where those data can be downloaded. The second one is about Forest and Climate project, as an example of geospatial analysis. The aim of the project was to identify the most suitable area for afforestation in Serbia by using satellite and Earth observation data. The results can be found at https://sumeiklima.org/.

More from Institute of Contemporary Sciences (20)