FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)

•

0 likes•582 views

FOSS Sea 2014 (http://geekslab.co/events/21-foss-sea-2014-infrastructure-for-researchers) DataWarehouse & BigData _Владимир Слободянюк - Delivery Manager at Luxoft

www.luxoft.com
DWH & Big Data
Odessa
Vladimir Slobodianiuk
Date: 2014

www.luxoft.com
Agenda
1
2
Big Data – what is it
Hadoop vs RDBMS – pros and cons
3 Hadoop & Enterprise architecture
4 Hadoop as ETL engine
5 Case Studies

www.luxoft.com
Current state
 Big data - is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using traditional data processing
applications.

www.luxoft.com
Limitations & Problems
 Big data is difficult to work with using
most relational databases, requiring
instead massively parallel software
running on tens, hundreds, or even
thousands of servers
 eBay.com uses two data warehouses at 7.5 petabytes
 Walmart handles more than 1 million customer
transactions every hour
 Facebook handles 50 billion photos from its user base
 In 2012, the Obama administration announced the Big
Data Research and Development Initiative

www.luxoft.com
CORE HADOOP - MapReduce
In 2004, Google published a paper on a process called MapReduce
 DISTRIBUTED
COMPUTING
FRAMEWORK
 Process large jobs in
parallel across many
nodes and combine the
results

www.luxoft.com
Hadoop Structure
 HDFS is a distributed file system designed to run on commodity hardware
 HBase store data rows in labelled tables (sortable key and an arbitrary number of columns)
 Hive provide data summarization, query, and analysis (SQL-like interface)
 Pig is a platform for analyzing large data sets that consists of a high-level language

www.luxoft.com
Hadoop vs RDBMS
Hadoop RDBMS
 Performance for relational data
 Machine query optimization
 Mature workload management
 High concurrency interactive query
processing
How might this change in the future
 Query Optimization Improvements in Hive
– Statistics, better join ordering, more join types, etc
 Startup Time Improvements
– Simpler query plans to pass out
 Runtime Performance Improvements
 Schema-less Model
 Human query optimization
 Ability to create complex dataflow
with multiple inputs and outputs
 Parallelize many Analytic Functions

www.luxoft.com
Hadoop &
Enterprise architecture

www.luxoft.com
Classic architecture approach

www.luxoft.com
Case Study 1
Hadoop as ETL Data Quality tool
BENEFITS
 Reduced TCO (commodity hardware usage)
 Traceability of all the data quality issues
 Hadoop becomes clean data tool.
PROBLEM
Traditional tools show poor performance in exception
and data cleansing.
SOLUTION
Hadoop transforms the data into single format and
processes it using data cleansing workflows.

www.luxoft.com
Case Study 2
Know Your Customer PoC
Business Challenge
• Knowing the actual customer
reaction to products is essential
for business growth, but it’s
difficult to get valuable insights.
Social media is the place where
customer really share their
opinion
SOLUTION
Hadoop-based analysis tool that
provides the ability to:
• Find the events in the client
streams, identify needed
reaction
• Propose a product to a client,
based on his interests

www.luxoft.com
Case Study 3
Enterprise ETL & Hadoop Integration
Goals:
 MapReduce ETL jobs development
without coding
 Build, re-use, and check impact analysis
with enhanced metadata capabilities
 A windows-based graphical development
environment
 Comprehensive built-in transformations
 A library of Use Case Accelerators to
fast-track Hadoop productivity

www.luxoft.com
Big Data:

Cutting edge of DI technologies

State-of-the-art design approaches

A bit more than simple development, it's some of art, art
of data management
Summary

Cheetah is a custom data warehouse system built on top of Hadoop that provides high performance for storing and querying large datasets. It uses a virtual view abstraction over star and snowflake schemas to provide a simple yet powerful SQL-like query language. The system architecture utilizes MapReduce to parallelize query execution across many nodes. Cheetah employs columnar data storage and compression, multi-query optimization, and materialized views to improve query performance. Based on evaluations, Cheetah can efficiently handle both small and large queries and outperforms single-query execution when processing batches of queries together.

TechEvent DWH Modernization

Trivadis

The document discusses whether a company needs a data lake. It describes the customer's current and desired data warehouse situation, including a daily delta load requirement and a need for streaming and external data. It then covers data warehouse architectures, big data technologies, and reference architectures that combine a data warehouse with a data lake. While a data lake provides benefits like flexibility, streaming, and accessing more data sources and years of data, it also introduces costs, complexity, and new skills requirements.

DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

Datawarehouse Trainings

This document provides an overview of data warehousing concepts including: - The key differences between operational systems and data warehouses in terms of design, usage, and data characteristics. - The benefits of implementing a data warehouse for business intelligence and decision making. - Common data warehousing architectures and approaches including top-down, bottom-up, and hybrid approaches. - Fundamental data modeling techniques for data warehouses including entity-relationship modeling and dimensional modeling.

GigaOm-sector-roadmap-cloud-analytic-databases-2017

Jeremy Maranitch

This document provides a sector roadmap for cloud analytic databases in 2017. It discusses key topics such as usage scenarios, disruption vectors, and an analysis of companies in the sector. Some main points: - Cloud databases can now be considered the default option for most selections in 2017 due to economics and functionality. - Several newer cloud-native offerings have been able to leapfrog more established databases through tight integration of cloud features like elasticity and separation of compute and storage. - While traditional database functionality is still required, cloud dynamics are causing needs for capabilities like robust SQL support, diverse data support, and dynamic environment adaptation. - Vendor solutions are evaluated on disruption vectors including SQL support, optimization, elasticity, environment

Balance agility and governance with #TrueDataOps and The Data Cloud

Kent Graziano

DataOps is the application of DevOps concepts to data. The DataOps Manifesto outlines WHAT that means, similar to how the Agile Manifesto outlines the goals of the Agile Software movement. But, as the demand for data governance has increased, and the demand to do “more with less” and be more agile has put more pressure on data teams, we all need more guidance on HOW to manage all this. Seeing that need, a small group of industry thought leaders and practitioners got together and created the #TrueDataOps philosophy to describe the best way to deliver DataOps by defining the core pillars that must underpin a successful approach. Combining this approach with an agile and governed platform like Snowflake’s Data Cloud allows organizations to indeed balance these seemingly competing goals while still delivering value at scale. Given in Montreal on 14-Dec-2021

Integrated dwh 3

Gwen (Chen) Shapira

This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It describes why a data warehouse may need Hadoop to handle big data from sources like social media, sensors and logs. Examples are given of using Hadoop for ETL and analytics. The presentation provides an overview of Hadoop and how to connect it to the data warehouse using tools like Sqoop and external tables. It also offers tips on getting started and avoiding common pitfalls.

DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL

DataStax

The document discusses Cassandra and how it is used by various companies for applications requiring scalability, high performance, and reliability. It summarizes Cassandra's capabilities and how companies like Netflix, Backupify, Ooyala, and Formspring have used Cassandra to handle large and increasing amounts of data and queries in a scalable and cost-effective manner. The document also describes DataStax's commercial offerings around Apache Cassandra including support, tools, and services.

Kb 40 kevin_klineukug_reading20070717[1]

shuwutong

This document summarizes a presentation by Kevin Kline on strategies for addressing common SQL Server challenges. The presentation covered topics such as tuning disk I/O, managing very large databases, and an overview of Quest software solutions for SQL Server monitoring and performance. Key points included strategies for tiered storage, partitioning very large databases, monitoring disk queue lengths and page reads/writes in SQL Server.

Snowflake is a cloud-based data warehouse system that allows enterprises to store and analyze both structured and semi-structured data. It creates separate virtual warehouses for different workloads so they do not compete for computing resources and can easily scale up or down. Snowflake has grown exponentially since being founded in 2012, reaching a $3.5 billion valuation in October 2018. It sells data warehousing services using a pay-as-you-use business model.

Data Mesh for Dinner

Kent Graziano

Data mesh is a decentralized approach to managing and accessing analytical data at scale. It distributes responsibility for data pipelines and quality to domain experts. The key principles are domain-centric ownership, treating data as a product, and using a common self-service infrastructure platform. Snowflake is well-suited for implementing a data mesh with its capabilities for sharing data and functions securely across accounts and clouds, with built-in governance and a data marketplace for discovery. A data mesh implemented on Snowflake's data cloud can support truly global and multi-cloud data sharing and management according to data mesh principles.

Intro to Data Vault 2.0 on Snowflake

Kent Graziano

This document provides an introduction and overview of implementing Data Vault 2.0 on Snowflake. It begins with an agenda and the presenter's background. It then discusses why customers are asking for Data Vault and provides an overview of the Data Vault methodology including its core components of hubs, links, and satellites. The document applies Snowflake features like separation of workloads and agile warehouse scaling to support Data Vault implementations. It also addresses modeling semi-structured data and building virtual information marts using views.

Altis AWS Snowflake Practice

SamanthaSwain7

Piranha vs. mammoth predator appliances that chew up big data

Jack (Yaakov) Bezalel

If you also got the Big Data itch, here is something to ease the pain :-) Answers to this questions will be available soon (more info in the attached link) Which Big Data Appliance should YOU use? (click on the attached link for Poll results) Appliances are Small and Quick, Right? Revealing the 6 Types of Big Data Appliances Uncovering the Main Players Challenges, Pitfalls, and Winning the Big Data Game Where is all this leading YOU to?

Rise of the Data Cloud

Kent Graziano

This talk will introduce you to the Data Cloud, how it works, and the problems it solves for companies across the globe and across industries. The Data Cloud is a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Wherever data or users live, Snowflake delivers a single and seamless experience across multiple public clouds. Snowflake’s platform is the engine that powers and provides access to the Data Cloud

Snowflake: The Good, the Bad, and the Ugly

Tyler Wishnoff

HOW TO SAVE PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...

Kent Graziano

A good data model, done right the first time, can save you time and money. We have all seen the charts on the increasing cost of finding a mistake/bug/error late in a software development cycle. Would you like to reduce, or even eliminate, your risk of finding one of those errors late in the game? Of course you would! Who wouldn't? Nobody plans to miss a requirement or make a bad design decision (well nobody sane anyway). No data modeler or database designer worth their salt wants to leave a model incomplete or incorrect. So what can you do to minimize the risk? In this talk I will show you a best practice approach to developing your data models and database designs that I have been using for over 15 years. It is a simple, repeatable process for reviewing your data models. It is one that even a non-modeler could follow. I will share my checklist of what to look for and what to ask the data modeler (or yourself) to make sure you get the best possible data model. As a bonus I will share how I use SQL Developer Data Modeler (a no-cost data modeling tool) to collect the information and report it.

Chug building a data lake in azure with spark and databricks

Brandon Berlinrut

- The document discusses building a data lake in Azure using Spark and Databricks. It begins with an introduction of the presenter and their experience. - The rest of the document is organized into sections that discuss decisions around why to use a data lake and Azure/Databricks, how to build the lake by ingesting and organizing data, using Delta Lake for integrated and curated layers, securing the lake, and enabling analytics against the lake. - The key aspects covered include getting data into the lake from various sources using custom Spark jobs, organizing the lake into layers, cataloging data, using Delta Lake for transactional tables, implementing role-based security, and allowing ad-hoc queries.

Building the Enterprise Data Lake - Important Considerations Before You Jump In

SnapLogic

This document discusses considerations for building an enterprise data lake. It begins by introducing the presenters and stating that the session will not focus on SQL. It then discusses how the traditional "crab" model of data delivery does not scale and how organizations have shifted to industrialized data publishing. The rest of the document discusses important aspects of data lake architecture, including how different types of data like sensor data require new approaches. It emphasizes that the data lake requires a distributed service architecture rather than a monolithic structure. It also stresses that the data lake consists of three core subsystems for acquisition, management, and access, and that these depend on underlying platform services.

Demystifying Data Warehousing as a Service - DFW

Kent Graziano

This document provides an overview and introduction to Snowflake's cloud data warehousing capabilities. It begins with the speaker's background and credentials. It then discusses common data challenges organizations face today around data silos, inflexibility, and complexity. The document defines what a cloud data warehouse as a service (DWaaS) is and explains how it can help address these challenges. It provides an agenda for the topics to be covered, including features of Snowflake's cloud DWaaS and how it enables use cases like data mart consolidation and integrated data analytics. The document highlights key aspects of Snowflake's architecture and technology.

How to Take Advantage of an Enterprise Data Warehouse in the Cloud

Denodo

Watch full webinar here: [https://buff.ly/2CIOtys] As organizations collect increasing amounts of diverse data, integrating that data for analytics becomes more difficult. Technology that scales poorly and fails to support semi-structured data fails to meet the ever-increasing demands of today’s enterprise. In short, companies everywhere can’t consolidate their data into a single location for analytics. In this Denodo DataFest 2018 session we’ll cover: Bypassing the mandate of a single enterprise data warehouse Modern data sharing to easily connect different data types located in multiple repositories for deeper analytics How cloud data warehouses can scale both storage and compute, independently and elastically, to meet variable workloads Presentation by Harsha Kapre, Snowflake

Enabling a Data Mesh Architecture with Data Virtualization

Denodo

Watch full webinar here: https://bit.ly/3rwWhyv The Data Mesh architectural design was first proposed in 2019 by Zhamak Dehghani, principal technology consultant at Thoughtworks, a technology company that is closely associated with the development of distributed agile methodology. A data mesh is a distributed, de-centralized data infrastructure in which multiple autonomous domains manage and expose their own data, called “data products,” to the rest of the organization. Organizations leverage data mesh architecture when they experience shortcomings in highly centralized architectures, such as the lack domain-specific expertise in data teams, the inflexibility of centralized data repositories in meeting the specific needs of different departments within large organizations, and the slow nature of centralized data infrastructures in provisioning data and responding to changes. In this session, Pablo Alvarez, Global Director of Product Management at Denodo, explains how data virtualization is your best bet for implementing an effective data mesh architecture. You will learn: - How data mesh architecture not only enables better performance and agility, but also self-service data access - The requirements for “data products” in the data mesh world, and how data virtualization supports them - How data virtualization enables domains in a data mesh to be truly autonomous - Why a data lake is not automatically a data mesh - How to implement a simple, functional data mesh architecture using data virtualization

SnapLogic Cloud Integration

SnapLogic

Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...

Amazon Web Services

Snowflake is a cloud-based data warehouse that is built for the cloud. It was founded in 2012 and has raised $1 billion in funding. Snowflake's architecture separates storage, compute, and metadata services, allowing it to offer unlimited scalability, multiple clusters that can access shared data with no downtime, and full transactional consistency across the system. Snowflake has over 2000 customers including large enterprises that use it for analytics, data science, and sharing large volumes of data securely.

(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014

Amazon Web Services

This document discusses a platform called EzBake that was created to help a US government customer modernize their systems and better analyze large amounts of data. EzBake provides tools to easily develop and deploy applications, integrate and analyze data from various sources, and implement security controls. It improved the customer's ability to share data and applications across many teams and networks, decreased development times from 6-8 months to 3-4 weeks, and reduced costs while increasing capabilities.

Better Together: The New Data Management Orchestra

Cloudera, Inc.

To ingest, store, process and leverage big data for maximum business impact requires integrating systems, processing frameworks, and analytic deployment options. Learn how Cloudera’s enterprise data hub framework, MongoDB, and Teradata Data Warehouse working in concert can enable companies to explore data in new ways and solve problems that not long ago might have seemed impossible. Gone are the days of NoSQL and SQL competing for center stage. Visionary companies are driving data subsystems to operate in harmony. So what’s changed? In this webinar, you will hear from executives at Cloudera, Teradata and MongoDB about the following: How to deploy the right mix of tools and technology to become a data-driven organization Examples of three major data management systems working together Real world examples of how business and IT are benefiting from the sum of the parts Join industry leaders Charles Zedlewski, Chris Twogood and Kelly Stirman for this unique panel discussion, moderated by BI Research analyst, Colin White.

Introducing the Snowflake Computing Cloud Data Warehouse

Snowflake Computing

Introducing Snowflake, an elastic data warehouse delivered as a service in the cloud. It aims to simplify data warehousing by removing the need for customers to manage infrastructure, scaling, and tuning. Snowflake uses a multi-cluster architecture to provide elastic scaling of storage, compute, and concurrency. It can bring together structured and semi-structured data for analysis without requiring data transformation. Customers have seen significant improvements in performance, cost savings, and the ability to add new workloads compared to traditional on-premises data warehousing solutions.

Building a Logical Data Fabric using Data Virtualization (ASEAN)

Denodo

Watch full webinar here: https://bit.ly/3FF1ubd In the recent Building the Unified Data Warehouse and Data Lake report by leading industry analysts TDWI, we have discovered 64% of organizations stated the objective for a unified Data Warehouse and Data Lakes is to get more business value and 84% of organizations polled felt that a unified approach to Data Warehouses and Data Lakes was either extremely or moderately important. In this session, you will learn how your organization can apply a logical data fabric and the associated technologies of machine learning, artificial intelligence, and data virtualization can reduce time to value. Hence, increasing the overall business value of your data assets. KEY TAKEAWAYS: - How a Logical Data Fabric is the right approach to assist organizations to unify their data. - The advanced features of a Logical Data Fabric that assist with the democratization of data, providing an agile and governed approach to business analytics and data science. - How a Logical Data Fabric with Data Virtualization enhances your legacy data integration landscape to simplify data access and encourage self-service.

Форс банковские системыEgor Sulkin

ReqLabs PechaKucha Евгений Сафроненко

PechaKucha Ukraine

What's hot

Delivering Data Democratization in the Cloud with Snowflake

Kent Graziano

Company report xinglian

Xinglian Liu

Data Mesh for Dinner

Kent Graziano

Intro to Data Vault 2.0 on Snowflake

Kent Graziano

Altis AWS Snowflake Practice

SamanthaSwain7

Piranha vs. mammoth predator appliances that chew up big data

Jack (Yaakov) Bezalel

Rise of the Data Cloud

Kent Graziano

Snowflake: The Good, the Bad, and the Ugly

Tyler Wishnoff

HOW TO SAVE PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...

Kent Graziano

Chug building a data lake in azure with spark and databricks

Brandon Berlinrut

Building the Enterprise Data Lake - Important Considerations Before You Jump In

SnapLogic

Demystifying Data Warehousing as a Service - DFW

Kent Graziano

How to Take Advantage of an Enterprise Data Warehouse in the Cloud

Denodo

Enabling a Data Mesh Architecture with Data Virtualization

Denodo

SnapLogic Cloud Integration

SnapLogic

Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...

Amazon Web Services

(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014

Amazon Web Services

Better Together: The New Data Management Orchestra

Cloudera, Inc.

Introducing the Snowflake Computing Cloud Data Warehouse

Snowflake Computing

Building a Logical Data Fabric using Data Virtualization (ASEAN)

Denodo

What's hot (20)

Delivering Data Democratization in the Cloud with Snowflake

Company report xinglian

Data Mesh for Dinner

Intro to Data Vault 2.0 on Snowflake

Altis AWS Snowflake Practice

Piranha vs. mammoth predator appliances that chew up big data

Rise of the Data Cloud

Snowflake: The Good, the Bad, and the Ugly

HOW TO SAVE PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...

Chug building a data lake in azure with spark and databricks

Building the Enterprise Data Lake - Important Considerations Before You Jump In

Demystifying Data Warehousing as a Service - DFW

How to Take Advantage of an Enterprise Data Warehouse in the Cloud

Enabling a Data Mesh Architecture with Data Virtualization

SnapLogic Cloud Integration

Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...

(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014

Better Together: The New Data Management Orchestra

Introducing the Snowflake Computing Cloud Data Warehouse

Building a Logical Data Fabric using Data Virtualization (ASEAN)

Viewers also liked

Форс банковские системыEgor Sulkin

ReqLabs PechaKucha Евгений Сафроненко

PechaKucha Ukraine

Fieldbus Intro unity V1

Пупена Александр

Решения виртуализации Huawei FusionCloud VDI

Dmitriy Rogozinskiy

Алексей Чумаков. Apache Cassandra на реальном проектеVolha Banadyseva

Александр Соловьёв, Griddynamics.com

Ontico

Введение в Apache Cassandra

Alexander Tivelkov

Создание географически-распределенных датацентров на базе инженерных системAndrey Akulov

Технологии и продукты Oracle для обработки и анализа Больших Данных

Andrey Akulov

SSAS: multidemention vs tabular modeAndrey Korshikov

Big Data aggregation techniques

Valentin Logvinskiy

Social media art: как художники используют цифровую реальность

Feedback Media Agency

Social Media Art - это искусство, вдохновленное и воплощенное в новейших digital-формах (Facebook, Twitter, Wikipedia) - от медитативного net-арта до возмутительного социального перформанса. Вы узнаете о художниках-первопроходцах, которые вместо красок и холста используют лайки и перепосты.

Enterprise Architecture - Sergey Orlik (Microsoft Platforma 2011)Sergey Orlik

DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...

WG_ Events

В докладе Игорь продемонстрирует архитектуру хранилища, базирующуюся на технологиях от Cloudera и Oracle. Он расскажет об опыте интеграции множества источников данных с помощью самописных решений и использованием специализированных инструментов, как Apache NiFi. Выступление заинтересует технических специалистов, которые уже знакомы со стеком Hadoop. Интересуетесь анализом данных? Присоединяйтесь к нашей группе на Facebook: https://www.facebook.com/groups/DataTalks/

Apache Cassandra. Ещё одно NoSQL хранилище (Владимир Климонтович)Ontico

3 ibm bdw2015

antishmanti

Движение по хрупкому дну / Сергей Караткевич (servers.ru)

Ontico

Сегодня Интернет увлечен микросервисами, контейнерами и immutable-инфраструктурой. Очень сложно не поддаться искушению внедрить что-то подобное в компании, в которой вы работаете сейчас. Я попытаюсь отговорить вас использовать эти технологии во вред приложению, себе и бизнесу компании в целом. Я расскажу о типовом проекте, который был запущен в 20 странах за 4 месяца, проблемах, которые я встретил, и выводах, которые я сделал. - Почему микросервисы не спасут, а похоронят ваш проект. Я расскажу на основе собственного опыта, почему не стоит увлекаться микросервисами для небольших проектов, почему благие намерения — упрощение деплоя и увеличение числа деплоев, увеличение доступности и улучшение масштабирования ведут к отсутствию гибкости и критическому уменьшению стабильности системы. - Почему ваша система слишком сложна для своих задач. Я расскажу, почему не стоит усложнять систему, почему, скорее всего, ваша система слишком сложна для задач, которые она решает и почему вы не контролируете то, что происходит в системе. Я объясню, почему вы потратите все свое время на отладку сложной системы, вместо того чтобы решать задачи бизнеса. - Почему Docker используется неправильно. Будут предоставлены реальные примеры использования Docker для нового проекта и для портированного проекта, я объясню, с какими проблемами сталкиваются операторы при работе с Docker на живых примерах, объясню, почему вы, скорее всего, используете Docker неправильно, и предложу варианты, как этого избежать. - Почему immutable слишком статичен для вашей компании. Я расскажу про свой опыт работы с immutable и объясню, почему, на мой взгляд, переход к подобной инфраструкт

Data Lake vs. Data Warehouse: Which is Right for Healthcare?

Health Catalyst

The data lake style of a data warehouse architecture is a flexible alternative to a traditional data warehouse. It allows for unstructured data. When a warehousing approach requires that the data be in a structured format, there are constraints on the analyses that can be performed because not all of the data can be structured early. The data lake concept is very similar to our Late-Binding approach in that data lakes are our source marts. We increase the efficiency and effectiveness of these through: 1. Metadata, 2. Source Mart Designer, and 3. Subject Area Mart Designer.

Viewers also liked (18)

Форс банковские системы

ReqLabs PechaKucha Евгений Сафроненко

Fieldbus Intro unity V1

Решения виртуализации Huawei FusionCloud VDI

Алексей Чумаков. Apache Cassandra на реальном проекте

Александр Соловьёв, Griddynamics.com

Введение в Apache Cassandra

Создание географически-распределенных датацентров на базе инженерных систем

Технологии и продукты Oracle для обработки и анализа Больших Данных

SSAS: multidemention vs tabular mode

Big Data aggregation techniques

Social media art: как художники используют цифровую реальность

Enterprise Architecture - Sergey Orlik (Microsoft Platforma 2011)

DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...

Apache Cassandra. Ещё одно NoSQL хранилище (Владимир Климонтович)

3 ibm bdw2015

Движение по хрупкому дну / Сергей Караткевич (servers.ru)

Data Lake vs. Data Warehouse: Which is Right for Healthcare?

Similar to FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)

DWH & big data architecture approaches

Luxoft

Владимир Слободянюк «DWH & BigData – architecture approaches»

Anna Shymchenko

This document discusses approaches to data warehouse (DWH) and big data architectures. It begins with an overview of big data, describing its large size and complexity that makes it difficult to process with traditional databases. It then compares Hadoop and relational database management systems (RDBMS), noting pros and cons of each for distributed computing. The document outlines how Hadoop uses MapReduce and has a structure including HDFS, HBase, Hive and Pig. Finally, it proposes using Hadoop as an ETL and data quality tool to improve traceability, reduce costs and handle exception data cleansing more effectively.

Big Data Practice_Planning_steps_RK

Rajesh Jayarman

This document provides an overview of big data fundamentals and considerations for setting up a big data practice. It discusses key big data concepts like the four V's of big data. It also outlines common big data questions around business context, architecture, skills, and presents sample reference architectures. The document recommends starting a big data practice by identifying use cases, gaining management commitment, and setting up a center of excellence. It provides an example use case of retail web log analysis and presents big data architecture patterns.

Modern data warehouse

Stephen Alex

The document discusses the modern data warehouse and key trends driving changes from traditional data warehouses. It describes how modern data warehouses incorporate Hadoop, traditional data warehouses, and other data stores from multiple locations including cloud, mobile, sensors and IoT. Modern data warehouses use multiple parallel processing (MPP) architecture and the Apache Hadoop ecosystem including Hadoop Distributed File System, YARN, Hive, Spark and other tools. It also discusses the top Hadoop vendors and Oracle's technical innovations on Hadoop for data discovery, transformation, discovery and sharing. Finally, it covers the components of big data value assessment including descriptive, predictive and prescriptive analytics.

Modern data warehouse

Stephen Alex

The document discusses the modern data warehouse and key trends driving changes from traditional data warehouses. It describes how modern data warehouses incorporate Hadoop, traditional data warehouses, and other data stores from multiple locations including cloud, mobile, sensors and IoT. Modern data warehouses use multiple parallel processing (MPP) architecture for distributed computing and scale-out. The Hadoop ecosystem, including components like HDFS, YARN, Hive, Spark and Zookeeper, provide functionality for storage, processing, and analytics. Major vendors like Oracle provide technical innovations on Hadoop for data discovery, exploration, transformation, discovery and sharing capabilities. The document concludes with an overview of descriptive, predictive and prescriptive analytics capabilities in a big data value assessment.

Big data and apache hadoop adoption

faizrashid1995

Asterix Solution’s Hadoop Training is designed to help applications scale up from single servers to thousands of machines. With the rate at which memory cost decreased the processing speed of data never increased and hence loading the large set of data is still a big headache and here comes Hadoop as the solution for it. http://www.asterixsolution.com/big-data-hadoop-training-in-mumbai.html Duration - 25 hrs Session - 2 per week Live Case Studies - 6 Students - 16 per batch Venue - Thane

Hadoop data-lake-white-paper

Supratim Ray

This document discusses how Apache Hadoop provides a solution for enterprises facing challenges from the massive growth of data. It describes how Hadoop can integrate with existing enterprise data systems like data warehouses to form a modern data architecture. Specifically, Hadoop provides lower costs for data storage, optimization of data warehouse workloads by offloading ETL tasks, and new opportunities for analytics through schema-on-read and multi-use data processing. The document outlines the core capabilities of Hadoop and how it has expanded to meet enterprise requirements for data management, access, governance, integration and security.

Accelerating Big Data Analytics

Attunity

The document discusses using Attunity Replicate to accelerate loading and integrating big data into Microsoft's Analytics Platform System (APS). Attunity Replicate provides real-time change data capture and high-performance data loading from various sources into APS. It offers a simplified and automated process for getting data into APS to enable analytics and business intelligence. Case studies are presented showing how major companies have used APS and Attunity Replicate to improve analytics and gain business insights from their data.

Hadoop

Veera Sundari

This document discusses big data analytics techniques like Hadoop MapReduce and NoSQL databases. It begins with an introduction to big data and how the exponential growth of data presents challenges that conventional databases can't handle. It then describes Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using a simple programming model. Key aspects of Hadoop covered include MapReduce, HDFS, and various other related projects like Pig, Hive, HBase etc. The document concludes with details about how Hadoop MapReduce works, including its master-slave architecture and how it provides fault tolerance.

Big Data , Big Problem?

Mohammadhasan Farazmand

Hd insight overview

vhrocca

This document provides an overview of HDInsight and Hadoop. It defines big data and Hadoop, describing HDInsight as Microsoft's implementation of Hadoop in the cloud. It outlines the Hadoop ecosystem including HDFS, MapReduce, YARN, Hive, Pig and Sqoop. It discusses advantages of using HDInsight in the cloud and provides information on working with HDInsight clusters, loading and querying data, and different approaches to big data solutions.

Infrastructure Considerations for Analytical Workloads

Cognizant

Oct 2011 CHADNUG Presentation on Hadoop

Josh Patterson

Josh Patterson gave a presentation on Hadoop and how it has been used. He discussed his background working on Hadoop projects including for the Tennessee Valley Authority. He outlined what Hadoop is, how it works, and examples of use cases. This includes how Hadoop was used to store and analyze large amounts of smart grid sensor data for the openPDC project. He discussed integrating Hadoop with existing enterprise systems and tools for working with Hadoop like Pig and Hive.

A Glimpse of Bigdata - Introduction

saisreealekhya

What is hadoop

Asis Mohanty

Hadoop and the Data Warehouse: Point/Counter Point

Inside Analysis

Robin Bloor and Teradata Live Webcast on April 22, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6 Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment? Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop. Visit InsideAnlaysis.com for more information.

Big Data and Hadoop

MaulikLakhani

The document provides an introduction to big data and Hadoop. It defines big data as large datasets that cannot be processed using traditional computing techniques due to the volume, variety, velocity, and other characteristics of the data. It discusses traditional data processing versus big data and introduces Hadoop as an open-source framework for storing, processing, and analyzing large datasets in a distributed environment. The document outlines the key components of Hadoop including HDFS, MapReduce, YARN, and Hadoop distributions from vendors like Cloudera and Hortonworks.

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook

Amr Awadallah

Hadoop was developed to solve problems with data warehousing systems at Yahoo and Facebook that were limited in processing large amounts of raw data in real-time. Hadoop uses HDFS for scalable storage and MapReduce for distributed processing. It allows for agile access to raw data at scale for ad-hoc queries, data mining and analytics without being constrained by traditional database schemas. Hadoop has been widely adopted for large-scale data processing and analytics across many companies.

Deutsche Telekom on Big Data

DataWorks Summit

Deutsche Telekom and T-Systems are large European telecommunications companies. Deutsche Telekom has revenue of $75 billion and over 230,000 employees, while T-Systems has revenue of $13 billion and over 52,000 employees providing data center, networking, and systems integration services. Hadoop is an open source platform that provides more cost effective storage, processing, and analysis of large amounts of structured and unstructured data compared to traditional data warehouse solutions. Hadoop can help companies gain value from all their data by allowing them to ask bigger questions.

Hadoop Developer

Edureka!

This document outlines the modules and topics covered in an Edureka course on Hadoop. The 10 modules cover understanding Big Data and Hadoop architecture, Hadoop cluster configuration, MapReduce framework, Pig, Hive, HBase, Hadoop 2.0 features, and Apache Oozie. Interactive questions are also included to test understanding of concepts like Hadoop core components, HDFS architecture, and MapReduce job execution.

Similar to FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft) (20)

DWH & big data architecture approaches

Владимир Слободянюк «DWH & BigData – architecture approaches»

Big Data Practice_Planning_steps_RK

Modern data warehouse

Big data and apache hadoop adoption

Hadoop data-lake-white-paper

Accelerating Big Data Analytics

Hadoop

Big Data , Big Problem?

Hd insight overview

Infrastructure Considerations for Analytical Workloads

Oct 2011 CHADNUG Presentation on Hadoop

A Glimpse of Bigdata - Introduction

What is hadoop

Hadoop and the Data Warehouse: Point/Counter Point

Big Data and Hadoop

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook

Deutsche Telekom on Big Data

Hadoop Developer

More from GeeksLab Odessa

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Коррекция геометрических искажений оптических спутниковых снимков Алексей Кравченко (Senior Data Scientist at Zoral Labs) Мы рассмотрим разнообразие существующих спутниковых данных и способов их применения в сельском и лесном хозяйстве, картографировании земной поверхности. Далее сфокусируемся на задаче геометрической коррекции снимков как первом шаге процесса обработки спутниковых данных, включая геопривязку снимков, регистрацию изображений, субпиксельную идентификацию контрольных точек, совмещение каналов. Также расскажем о некоторых интересных и неожиданных подходах к определению ориентации и jitter спутников и построению маски облачности. Все материалы: http://datascience.in.ua/report2017

DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García (Data Solutions Manager at OpenSistemas, Madrid, Spain) We will have an introduction of what is the kappa architecture vs lambda architecture. We will see how kappa architecture is a good solution to implement solutions in (almost) real time when we need to analyze data in streaming. We will show in a case of real use: how architecture is designed, how pipelines are organized and how data scientists use it. We will review the most used technologies to implement it from apache Kafka + spark using Scala to new tools like apache beam / google dataflow. Все материалы: http://datascience.in.ua/report2017

DataScience Lab 2017_Блиц-доклад_Турский Виктор

GeeksLab Odessa

DataScience Lab 2017_Обзор методов детекции лиц на изображение

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Обзор методов детекции лиц на изображение Юрий Пащенко ( Research Engineer, Ring Labs) В данном докладе мы предлагаем обзор наиболее новых и популярных методов обнаружения лиц, таких как Viola-Jones, Faster-RCNN, MTCCN и прочих. Мы обсудим основные критерии оценки качества алгоритма а также базы, включая FDDB, WIDER, IJB-A. Все материалы: http://datascience.in.ua/report2017

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов Виктор Сарапин (CEO at V.I.Tech) Как эффективно определять дубликаты на десятках миллионов пациентов, и как определять пропущенные диагнозы и лечебные действия. Все материалы доступны по ссылке: http://datascience.in.ua/report2017

DataScienceLab2017_Блиц-доклад

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Recent deep learning approaches for speech generation Дмитрий Белевцов (Techlead at IBDI) В последние пол года появилось несколько важных моделей на базе глубоких нейронных сетей, способных успешно синтезировать человеческую речь на уровне отдельных сэмплов. Это позволило обойти многие недостатки классических спектральных подходов. В этом докладе я сделаю небольшой обзор архитектур наиболее популярных сетей, таких как Wavenet и SampleRNN. Все материалы доступны по ссылке: http://datascience.in.ua/report2017

DataScienceLab2017_Блиц-доклад

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Распределенные вычисления: использование BOINC в Data Science Виталий Кошура (Software Developer at Lohika) BOINC - это открытое программное обеспечение для распределенных вычислений. Данный доклад освещает использование приложения BOINC в различных областях науки, которые связаны с обработкой огромных массивов данных, на примере текущих активных исследовательских проектов. Все материалы доступны по ссылке: http://datascience.in.ua/report2017

DataScienceLab2017_Блиц-доклад

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Магистерская программа "Data Science" в УКУ Орест Купин(Master's Student at UCU) В этом докладе я расскажу вам о магистерской программе со специализацией в анализе данных в Украинском Католическом Университете. Я расскажу про структуру программы, основные курсы, а также опишу свой опыт как студента УКУ и поговорю об вызовах с которыми мы столкнулись в этом году. Все материалы доступны по ссылке: http://datascience.in.ua/report2017

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Cервинг моделей, построенных на больших данных с помощью Apache Spark Степан Пушкарев (GM (Kazan) at Provectus / CTO at Hydrosphere.io) После подготовки данных и обучения моделей на больших данных с использованием Apache Spark встает вопрос о том, как использовать обученные модели в реальных приложениях. Помимо модели важно не забывать про весь пайплайн пре-процессинга данных, который должен попасть в продакшн в том виде, в котором его спроектировал и реализовал дата саентист. Такие решения, как PMML/PFA, основанные на экспорте/импорте модели и алгоритма имеют очевидные недостатки и ограничения. В данном докладе мы предложим альтернативное решение, которое упрощает процесс использования моделей и пайплайнов в реальных боевых приложениях. Все материалы доступны по ссылке: http://datascience.in.ua/report2017

DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...

GeeksLab Odessa

DataScience Lab, 13 мая 2017 BioVec: Word2Vec в задачах анализа геномных данных и биоинформатики Дмитрий Новицкий (Старший научный сотрудник в ИПММС НАНУ) Этот доклад посвящен bioVec: применению технологии word2vec в задачах биоинфоматики. Сначала мы напомним как работает Word2vec и аналогичные ему методы Word Embedding. Затем расскажем об особенностях Word2vec в применении к геномным последовательностям-- основному виду данных в биоинформатике. Как обучать bioVec, и применять эту технологию к задачам классификации белков, предсказания их функции и др. В заключении мы продемонстрируем примеры кода для обучения и использования bioVec. Все материалы доступны по ссылке: http://datascience.in.ua/report2017

DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Data Sciences и Big Data в Телекоме Александр Саенко (Software Engineer at SoftServe/CISCO) Александр расскажет о некоторых интересных примерах использования Big Data и Data Science в Телекоме: оптимизация сотовой сети, улучшение клиентского опыта, модели прогнозирования местоположения мобильных устройств, предотвращения оттока абонентов, обнаружение фрода и других. Рассмотрит основные современные подходы к их решению на основе алгоритмов машинного обучения. Все материалы доступны по ссылке: http://datascience.in.ua/report2017

DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Высокопроизводительные вычислительные возможности для систем анализа данных Михаил Федосеев ( Архитектор инфраструктурных решений, LanTec) В докладе мы поговорим о hardware стороне систем анализа данных для случаев построения приватных облаков или локальных высокопроизводительных вычислительных кластеров. Рассмотрим какие технологии и комплексные решения от компании Hewlett Packard Enterprise позволяют ускорить процесс анализа данных. Это не только зарекомендовавшие в своей области лучшие в своем сегменте сервера линейки HPE Apollo, а так же высокоскоростные сетевые коммутаторы HPE, но и дополнительные вспомогательные элементы решения, такие как мощные графические карты NVIDIA и хост-процессоры Xeon Phi. Так же будет рассмотрен стек HPE Core HPC Software Stack, который позволяет администраторам контролировать использование ресурсов системы. Все материалы доступны по ссылке: http://datascience.in.ua/report2017

DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Мониторинг модных трендов с помощью глубокого обучения и TensorFlow, Ольга Романюк (Data Scientist at Eleks) В течении последних 8 месяцев мы в Eleks работали над системой отслеживания модных трендов, основанной на глубинной остаточной нейронной сети с тождественным отображением. При тренировке сети мы использовали онлайн увеличение объема данных, а также распараллеливание данных по двум картам GPU. Мы создали эту систему с нуля при помощи TensorFlow. В презентации я расскажу о практической стороне проекта, нюансах реализации и подводных камнях, с которыми мы столкнулись во время работы. Все материалы: http://datascience.in.ua/report2017

DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...

GeeksLab Odessa

DataScience Lab, 13 мая 2017 Кто здесь? Автоматическая разметка спикеров на телефонных разговорах Юрий Гуц (Machine Learning Engineer, DataRobot) Автоматическая аннотация спикеров — интересная задача в обработке мультимедиа-данных. Нам нужно ответить на вопрос "Кто говорит когда?", не зная ничего о количестве и личности спикеров, присутствующих на записи. В этом докладе мы рассмотрим работающие методы для аннотации спикеров на телефонных разговорах. Все материалы: http://datascience.in.ua/report2017

DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...

GeeksLab Odessa

From bag of texts to bag of clusters Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan) Мы рассмотрим современные подходы к кластеризации текстов и их визуализации. Начиная от классического K-means на TF-IDF и заканчивая Deep Learning репрезентациями текстов. В качестве практического примера, мы проанализируем набор сообщений из соц. сетей и попробуем найти основные темы обсуждения. Все материалы: http://datascience.in.ua/report2017

DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...

GeeksLab Odessa

Графические вероятностные модели для принятия решений в проектном управлении Ольга Татаринцева (Data Scientist at Eleks) Как часто вам приходится принимать решения, используя знания в определенной предметной области? На сколько хороши такие решения? А теперь представьте, что вы собрали знания лучших экспертов в предметной области. Похоже, что ваши решения, основанные на этих знаниях, будут куда более взвешенными, не так ли? Мы будем говорить о системе ProjectHealth, которая была построена на основе опыта лучших экспертов в проектном управлении в компании Eleks. Для реализации поставленной задачи была использована графовая вероятностная модель, а именно байесовская сеть, имплементированная на Python. За время работы над проектом мы прошли шаги от извлечения требований, поиска данных и построения модели с нуля до реализации BI дашборда с возможностью углубиться в детали, доходя до сырых данных. Сейчас ProjectHealth экономит большое количество времени для топ менеджмента и ресурсов компании, так как мониторит состояние бизнеса в малейших деталях ежедневно и делает это как настоящий эксперт. Все материалы: http://datascience.in.ua/report2017

DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...

GeeksLab Odessa

DataScienceLab, 13 мая 2017 Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации Максим Бевза (Research Engineer at Grammarly) Все алгоритмы машинного обучения нуждаются в настройке (тьюнинге). Часто мы используем Grid Search или Randomized Search или нашу интуицию для подбора гиперпараметров. Байесовская оптимизация поможет нам направить Randomized Search в те места, которые наиболее перспективны, так, чтобы тот же (или лучший) результат мы получили за меньшее количество итераций. Все материалы: http://datascience.in.ua/report2017

DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот

GeeksLab Odessa

DataScienceLab, 13 мая 2017 Как знать всё о покупателях (или почти всё)? Дарина Перемот (ML Engineer at SynergyOne) Раскроем собственный ответ на вопрос "Чего же хочет покупатель?". Поделимся результатами исследований транзакций и расскажем, есть ли у вас домашний питомец. А так же, продемонстрируем, как машинное обучение уже сейчас помогает узнавать вас ближе. Все материалы: http://datascience.in.ua/report2017

JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...

GeeksLab Odessa

JS Lab 2017, 25 марта Mapbox GL: как работают современные интерактивные карты Владимир Агафонкин (Lead JavaScript Engineer at MapBox) Mapbox GL JS — открытая JS-библиотека для создания современных интерактивных карт на основе WebGL. В разработке более трех лет, она сочетает в себе множество удивительных технологий, сложных алгоритмов и идей для достижения плавной отрисовки тысяч векторных объектов с миллионами точек в реальном времени. В этом докладе вы узнаете, как работает библиотека внутри, и с какими сложностями сталкиваются разработчики современных WebGL-приложений. В докладе: отрисовка шрифтов, триангуляция линий и полигонов, пространственные индексы, определение коллизий, расстановка надписей, кластеризация точек, обрезка фигур, упрощение линий, упаковка спрайтов, компактные бинарные форматы, параллельная обработка данных в браузере, тестирование отрисовки и другие сложности. Все материалы: http://jslab.in.ua/2017

JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js

GeeksLab Odessa

JS Lab2017, 25 марта, Одесса Под микроскопом: блеск и нищета микросервисов на node.js Илья Климов (CEO at Javascript.Ninja) "- Что это? - Микросервис! - И что он делает? - Микропадает". Про микросервисы сейчас не рассуждает только ленивый. Все рассказывают про то, как микросервисы спасают от сложности разработки, снижают время развертывание и повышают общую надежность систем. Этот доклад - про подводные камни, которые ждут оседлавших волну этого хайпа с Node.JS. Мы поговорим про ошибки, которые стоили мне и моей компании бессонных ночей, потерянной прибыли и, временами, веры в могущество микросервисной архитектуры. Все материалы: http://jslab.in.ua/ Организаторы: http://geekslab.org.ua/

More from GeeksLab Odessa (20)