The main idea of a Data Lake is to expose the company data in an agile and flexible way to the people within the company, but preserve safeguard and auditing features that are required for the company’s critical data. The way that most projects in this direction start out is by depositing all of the data in Hadoop, trying to infer the schema on top of the data and then use the data for analytics purposes via Hive or Spark. Described stack is a really good approach for many use cases, as it provides cheaply storing data in files and rich analytics on top. But many pitfalls and problems might show up on this road, which can be easily met by extending the toolset. The potential bottlenecks will be displayed as soon as the users arrive and start exploiting the Lake. From all of these reasons, planning and building a Data Lake within an organization requires strategic approach, in order to build an architecture that can support it.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
Speaker: Geetha Balasundaram, Developer at ThoughtWorks
From tools and technology to people and requirements, what's different in the data engineering space? App development is traditional now. All enterprises want to become data-guided. Data lake is good start yet the know-hows and do-hows are so many.
Experiences from building a data lake in the retail domain, the talk will be covering.
- What is this vast new space of data engineering,
- Why it is critical to think in terms of data rather than features
- How important it is to understand these technologies and create a data lake that is usable and insightful to business
Richard Vermillion, CEO of After, Inc. and Fulcrum Analytics, Inc. discusses data lakes and their value in supporting the warranty and extended service plain chain.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
Speaker: Geetha Balasundaram, Developer at ThoughtWorks
From tools and technology to people and requirements, what's different in the data engineering space? App development is traditional now. All enterprises want to become data-guided. Data lake is good start yet the know-hows and do-hows are so many.
Experiences from building a data lake in the retail domain, the talk will be covering.
- What is this vast new space of data engineering,
- Why it is critical to think in terms of data rather than features
- How important it is to understand these technologies and create a data lake that is usable and insightful to business
Richard Vermillion, CEO of After, Inc. and Fulcrum Analytics, Inc. discusses data lakes and their value in supporting the warranty and extended service plain chain.
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
Building the Enterprise Data Lake: A look at architecturemark madsen
The topic is building an Enterprise Data Lake, discussing high level data and technology architecture. We will describe the architecture of a data warehouse, how a data lake needs to differ, and show a high level functional and data architecture for a data lake. This webinar will cover:
Why dumping data into Hadoop and letting users get it out doesn't work
The difference between a Hadoop application and a Data Lake
Why new ideas about data architecture are a key element
An Enterprise Data Lake reference architecture to frame what must be built
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Data Lakes are early in the Gartner hype cycle, but companies are getting value from their cloud-based data lake deployments. Break through the confusion between data lakes and data warehouses and seek out the most appropriate use cases for your big data lakes.
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
Why and How has the Big Data based Enterprise Data Lake solution based on No-SQL and SQL technologies has become significantly effective in solving enterprise data challenges than its predecessor EDW which had tried and failed to solve the same problem entirely based on SQL database only.
Data Lakes are meant to support many of the same analytics capabilities of Data Warehouses while overcoming some of the core problems. Yet Data Lakes have a distinctly different technology base. This webinar will provide an overview of the standard architecture components of Data Lakes.
This will include:
The Lab and the factory
The base environment for batch analytics
Critical governance components
Additional components necessary for real-time analytics and ingesting streaming data
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
Did you know Microsoft provides a Hadoop Platform-as-a-Service (PaaS)? It’s called Azure HDInsight and it deploys and provisions managed Apache Hadoop clusters in the cloud, providing a software framework designed to process, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution that includes many Hadoop components such as HBase, Spark, Storm, Pig, Hive, and Mahout. Join me in this presentation as I talk about what Hadoop is, why deploy to the cloud, and Microsoft’s solution.
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR
Big Data International Keynote Speaker Mark van Rijmenam shared his vision on Hadoop Data Lakes during a Zaloni Webinar. What are the Hadoop Data Lake trends for 2016, what are the data lake challenges and how can organizations benefit from data lakes.
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...DataWorks Summit
Finance Data Lake objective is to create a centralized enterprise data repository for all Finance and Supply Chain data. It serves as the single source of truth. It enables a self-service discovery Analytics platform for business users to answer adhoc business questions and derive critical insights. The data lake is based on open source Hadoop big data platform and a very cost effective solution in breaking the ERP data silos and simplifying the data architecture in the enterprise.
POCs were conducted on in-house Hortonworks Hadoop data platform to validate the cluster performance for Production volumes. Based on business priorities, an initial roadmap was defined using 3 data sources including 2 SAP ERPs and Peoplesoft (OLTP systems). Development environment was established in AWS Cloud for agile delivery. The near real time data ingestion architecture for the data lake was defined using replication tools and custom SQOOP based micro-batching framework and data persisted in Apache Hive DB in ORC format. Data and user security is implemented using Apache Ranger and sensitive data stored at rest in encryption zones. Business data sets were developed in Hive scripts and scheduled using Oozie. Multiple reporting tools connectivity including SQL tools, Excel and Tableau were enabled for Self-service Analytics. Upon successful implementation of the initial phase, a full roadmap is established to extend the Finance data lake to over 25 data sources and enhance data ingestion to scale as well as enable OLAP tools on Hadoop.
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
Building the Enterprise Data Lake: A look at architecturemark madsen
The topic is building an Enterprise Data Lake, discussing high level data and technology architecture. We will describe the architecture of a data warehouse, how a data lake needs to differ, and show a high level functional and data architecture for a data lake. This webinar will cover:
Why dumping data into Hadoop and letting users get it out doesn't work
The difference between a Hadoop application and a Data Lake
Why new ideas about data architecture are a key element
An Enterprise Data Lake reference architecture to frame what must be built
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Data Lakes are early in the Gartner hype cycle, but companies are getting value from their cloud-based data lake deployments. Break through the confusion between data lakes and data warehouses and seek out the most appropriate use cases for your big data lakes.
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
Why and How has the Big Data based Enterprise Data Lake solution based on No-SQL and SQL technologies has become significantly effective in solving enterprise data challenges than its predecessor EDW which had tried and failed to solve the same problem entirely based on SQL database only.
Data Lakes are meant to support many of the same analytics capabilities of Data Warehouses while overcoming some of the core problems. Yet Data Lakes have a distinctly different technology base. This webinar will provide an overview of the standard architecture components of Data Lakes.
This will include:
The Lab and the factory
The base environment for batch analytics
Critical governance components
Additional components necessary for real-time analytics and ingesting streaming data
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
Did you know Microsoft provides a Hadoop Platform-as-a-Service (PaaS)? It’s called Azure HDInsight and it deploys and provisions managed Apache Hadoop clusters in the cloud, providing a software framework designed to process, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution that includes many Hadoop components such as HBase, Spark, Storm, Pig, Hive, and Mahout. Join me in this presentation as I talk about what Hadoop is, why deploy to the cloud, and Microsoft’s solution.
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR
Big Data International Keynote Speaker Mark van Rijmenam shared his vision on Hadoop Data Lakes during a Zaloni Webinar. What are the Hadoop Data Lake trends for 2016, what are the data lake challenges and how can organizations benefit from data lakes.
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...DataWorks Summit
Finance Data Lake objective is to create a centralized enterprise data repository for all Finance and Supply Chain data. It serves as the single source of truth. It enables a self-service discovery Analytics platform for business users to answer adhoc business questions and derive critical insights. The data lake is based on open source Hadoop big data platform and a very cost effective solution in breaking the ERP data silos and simplifying the data architecture in the enterprise.
POCs were conducted on in-house Hortonworks Hadoop data platform to validate the cluster performance for Production volumes. Based on business priorities, an initial roadmap was defined using 3 data sources including 2 SAP ERPs and Peoplesoft (OLTP systems). Development environment was established in AWS Cloud for agile delivery. The near real time data ingestion architecture for the data lake was defined using replication tools and custom SQOOP based micro-batching framework and data persisted in Apache Hive DB in ORC format. Data and user security is implemented using Apache Ranger and sensitive data stored at rest in encryption zones. Business data sets were developed in Hive scripts and scheduled using Oozie. Multiple reporting tools connectivity including SQL tools, Excel and Tableau were enabled for Self-service Analytics. Upon successful implementation of the initial phase, a full roadmap is established to extend the Finance data lake to over 25 data sources and enhance data ingestion to scale as well as enable OLAP tools on Hadoop.
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
Designing a Real Time Data Ingestion PipelineDataScience
In this presentation, Badar discusses DataScience Engineering’s process for designing a real time data ingestion pipeline, facing different technical challenges and changing business needs along the way. Badar will also discuss how big data technologies like Kafka/Kinesis, Spark/Hadoop, and Cassandra/DynamoDB help solve problems for high throughput data ingestion & processing.
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma, thought leader and coauthor of Architecting Data Lakes, offers lessons learned from the field to get you started.
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
Meaning making – separating signal from noise. How do we transform the customer's next input into an action that creates a positive customer experience? We make the data more intelligent, so that it is able to guide our actions. The Data Lake builds on Big Data strengths by automating many of the manual development tasks, providing several self-service features to end-users, and an intelligent management layer to organize it all. This results in lower cost to create solutions, "smart" analytics, and faster time to business value.
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
Hadoop provides a powerful platform for data science and analytics, where data engineers and data scientists can leverage myriad data from external and internal data sources to uncover new insight. Such power is also presenting a few new challenges. On the one hand, the business wants more and more self-service, and on the other hand IT is trying to keep up with the demand for data, while maintaining architecture and data governance standards.
In this webinar, Andrew Ahn, Data Governance Initiative Product Manager at Hortonworks, will address the gaps and offer best practices in providing end-to-end data governance in HDP. Andrew Ahn will be followed by Oliver Claude of Waterline Data, who will share a case study of how Waterline Data Inventory works with HDP in the Modern Data Architecture to automate the discovery of business and compliance metadata, data lineage, as well as data quality metrics.
India’s recent stand on Smart City Development and involvement of various high income countries; initiates the talk of ideal variables for smart city evolution by our own standards. With a vision of Urban Governance for general livability, it becomes imperative to study these parameters and ensure the evolution of our own concept of a Smart City. Our spatial planning models based on unique factors such as Human Diversity, Physical-Social networks and ICT impact on urban fabric, City resilience, etc. make it all the more interesting to evolve a blueprint for Planning a Smart City.
The paper centers the infrastructural developments for the Smart Urban Development in India. The research helps us arrive at a general line of action for Urban Planning implications catering to the Infrastructure Sector, amongst others; thus affecting environmental, social and economic structure significantly. The study further finds the scope of progress, encouraged from various government policies for successful implementation of Smart City Development. It also allows a peek into future scenario of improvements and deliberations particular to Indian standards in consideration with the scenario of other countries.
Enterprise Data Lake:
How to Conquer the Data Deluge and Derive Insights
that Matters
Data can be traced from various consumer sources.
Managing data is one of the most serious challenges faced
by organizations today. Organizations are adopting the data
lake models because lakes provide raw data that users can
use for data experimentation and advanced analytics.
A data lake could be a merging point of new and historic
data, thereby drawing correlations across all data using
advanced analytics. A data lake can support the self-service
data practices. This can tap undiscovered business value
from various new as well as existing data sources.
Furthermore, a data lake can aid data warehousing,
analytics, data integration by modernizing. However, lakes
also face hindrances like immature governance, user skills
and security.
Is the traditional data warehouse dead?James Serra
With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada
The term "Data Lake" has become almost as overused and undescriptive as "Big Data". Many believe that centralizing datasets in HDFS makes a data lake, but then they struggle to realize any tangible value. This talk will redefine the "Data Lake" by describing four specific, key characteristics that we at Koverse have learned are crucial to successful enterprise data lake deployments. These characteristics are 1) indexing and search across all data sets, 2) interactive access for all users in the enterprise, 3) multi-level access control, and 4) integration with data science tools. These characteristics define a system that lets people realize value from their data versus getting lost in the hype. The talk will go on to provide a technical description of how we have integrated several projects, namely Apache Accumulo, Hadoop, and Spark, to implement an enterprise data lake with these key features.
The term "Data Lake" has become almost as overused and undescriptive as "Big Data". Many believe that centralizing datasets in HDFS makes a data lake, but then they struggle to realize any tangible value. This talk will redefine the "Data Lake" by describing four specific, key characteristics that we at Koverse have learned are crucial to successful enterprise data lake deployments. These characteristics are 1) indexing and search across all data sets, 2) interactive access for all users in the enterprise, 3) multi-level access control, and 4) integration with data science tools. These characteristics define a system that lets people realize value from their data versus getting lost in the hype. The talk will go on to provide a technical description of how we have integrated several projects, namely Apache Accumulo, Hadoop, and Spark, to implement an enterprise data lake with these key features.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
In this document, we will present a very brief introduction to BigData (what is BigData?), Hadoop (how does Hadoop fits the picture?) and Cloudera Hadoop (what is the difference between Cloudera Hadoop and regular Hadoop?).
Please note that this document is for Hadoop beginners looking for a place to start.
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
Watch full webinar here: https://bit.ly/3aePFcF
Historically data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multipurpose data lakes are the future of data analysis for a broad range of business users.
Attend this session to learn:
- The restrictions of physical single purpose data lakes
- How to build a logical multi purpose data lake for business users
- The newer use cases that makes multi purpose data lakes a necessity
Minimizing the Complexities of Machine Learning with Data VirtualizationDenodo
Watch full webinar here: https://buff.ly/309CZ1Y
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
*How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
*How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
*How you can use the Denodo Platform with large data volumes in an efficient way
*About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
Building a Logical Data Fabric using Data Virtualization (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3FF1ubd
In the recent Building the Unified Data Warehouse and Data Lake report by leading industry analysts TDWI, we have discovered 64% of organizations stated the objective for a unified Data Warehouse and Data Lakes is to get more business value and 84% of organizations polled felt that a unified approach to Data Warehouses and Data Lakes was either extremely or moderately important.
In this session, you will learn how your organization can apply a logical data fabric and the associated technologies of machine learning, artificial intelligence, and data virtualization can reduce time to value. Hence, increasing the overall business value of your data assets.
KEY TAKEAWAYS:
- How a Logical Data Fabric is the right approach to assist organizations to unify their data.
- The advanced features of a Logical Data Fabric that assist with the democratization of data, providing an agile and governed approach to business analytics and data science.
- How a Logical Data Fabric with Data Virtualization enhances your legacy data integration landscape to simplify data access and encourage self-service.
Filip Panjevic is a Co-Founder and CTO at ydrive.ai - startup dealing with self-driving cars, and one of the founders of Petnica Machine Learning School.
Filip's talk will focus on the story of Petnica School, how did it start, what has changed since the beginning, how the concept of school looks right now and why is that concept good for making new data scientists. This talk will be perfect for people who consider starting their careers in the data science field!
The talk will be a broad overview and thoughts about building one of the biggest data science communities in India. I will talk about how an ecosystem is created and value delivered to each stakeholder. I will be sharing my experience of building MachineHack and AIMinds and other platforms. One of the core agendas of the talk will be how these platforms have enabled a unique data science education and learning experience in India. The platforms built help students and engineers to imagine and work towards a career in data science.
In Drazen talk, you will get a chance to listen to how Data Science Master 4.0 on Belgrade University was created, and what are the benefits of the program.
PwC's recently released Responsible AI Diagnostic surveyed around 250 senior business executives from May to June 2019. The survey says that 84% of CEOs agree that AI-based decisions need to be explainable in order to be trusted. In the past few years, Deep learning has shown remarkable results in various applications, which makes it one of the first choices for many AI use cases. However, deep learning models are hard to explain, and since the majority of CEOs expect AI solutions to be explainable, deep learning has a serious challenge. Daniel Kahneman, in his book thinking fast and slow, presented two different systems the human brain uses to form thoughts and decisions: System 1: fast, intuitive and hard to explain System 2: slow, conscious and easy to explain In this talk I will present: A) PwC Responsible AI Survey B) A proposed deep learning framework that mimics the two systems of thinking C) The recent advances in the neural symbolic learning field.
Challenges in building a churn prediction model in different industries, presented by Jelena Pekez from Comtrade System Integration. Talk is focused on real-life use-case experience.
In my talk I am going to share with the audience a practical experience of using BI solutions for steering bank credit portfolios, make data actionable, communicate and collaborate on that data with relevant stakeholders. In our case, we have aimed for a solution that can use data-models based on Claud and on-premise, easily communicate and share information within the organization and keep track of that information flow. In addition, we want our solution to support various datasets and to have the flexibility of integrating the most popular DS languages – R and Python for the convenience and flexibility of our data science team. Our solution is based on Power BI plus the use of Azure Analytical Service and R.
The talk will have 3 parts. The overview of the practical applications of the AI and ML in the FinTech industry with a short explanation of the PSD2 directive and the disruption is caused. Application of the AI/ML from the perspective of the end-user, personal financial health, financial coach, etc. The overview of the architecture, technologies, and frameworks used with practical examples from the Zuper company.
We present a recommender system for personalized financial advice, which we designed for a large Swiss private bank. The final recommendations produced by the system were delivered to the end clients through a mobile banking platform. The recommender system is based on a collaborative filtering technique and can work with changing asset features, operate with implicit ratings and react to explicit feedback that clients can give using the mobile app. Moreover, we developed and implemented an approach to provide an explanation for each recommendation in the form “As you bought A, you might like B".
This talk shall focus on making real-time pipelines using cutting edge Big Data technologies and applying ML on gathered data. The first part of the presentation shall cover importance and necessity for streaming data processing. In addition, tools that could be used in order to build a streaming pipeline shall be proposed. The second part of this talk shall focus on making machine learning models in customer support. There shall be introduced success stories covering the need for more efficient customer support, problem resolution and gained benefits.
Presentation of the first complete AI investment platform. It is based on most innovative AI methods: most advanced neural networks (ResNet/DenseNet, LSTM, GAN autoencoders) and reinforcement learning for risk control and position sizing using Alpha Zero approach. It shows how the complex AI system which covers both supervised and reinforcement learning could be successfully used to investment portfolio optimization in real-time. The architecture of the platform and used algorithms will be presented together with the workflow of machine learning. Also, the real demo of the platform will be shown.
A lot of companies make the mistake of thinking that just hiring Data Scientists will lead to increased revenue or increased profit. For a company’s investment in Data Science to be successful the Data Scientists need to work on the right problems, with the right people, and with the right tools. In this presentation, I will talk about the lessons I have learned, and mistakes made in applying Data Science in commercial settings over the last 10 years. I will highlight what processes can increase the chances of Data Science investment being successful.
The talk would be focusing on reasons and method for creating models which maximize sales price Gross Margin but still has high confidentiality that quote would be accepted by the client. Price changes are dynamic things that are impacted by many different elements like cost of input material, labor cost, transportation cost, scrap material due to different ordered quantities, etc. Besides input cost segments, output price is also impacted by different marketing campaigns (own and others), seasonality, past and future customer behavior as well as the behavior of the product we are selling.
Andjela will share the best practices that Things Solver brings when it comes to data monetization. Things Solver clients sell more customize offerings and end up with happier customers. Andjela will share machine learning modules that do just that within Coeus. Things Solver platform.
In the past few years, many businesses started do understand the potential of real-time data analytics. And many of those invested time, energy and finances to make it happen, with weaker outcomes than expected. Reasons are few for this: too ambitious plans by leadership regarding leveraging data, not enough discipline defining goals and MVP for initial use cases, a plethora of tools and vendors available who claim that can solve all the problems, etc. So, how can we get the most value with reasonable costs out of fast (real-time) data? We will try to answer this question and give actionable advice.
University of Nottingham Ningbo China The advances of 5G, sensor, and information technologies have enabled the proliferation of smart pervasive sensor networks. Rapid progress in the design of biomedical sensors, advances in the management of medical knowledge, and improvement of algorithms for decision support, are fueling a technological disruption to health monitoring. Current technologies enable personalized A3 (anyplace, anytime, anywhere) health monitoring. Continuous health monitoring enables the extension of health care into home and workplace changing the modes of traditional health care delivery. Medical grade systems require innovative solutions for system dependability, medical decision support, data management, and interpretation, beyond current fitness and wellbeing applications. We will present innovative solutions for A3 health monitoring and discuss the use of blockchain technologies, and artificial intelligence addressing technical, medical, and ethical requirements for personalized health monitoring systems.
Data Quality is essential for e-commerce and automating it can reduce a business’s daily bottlenecks and promote its competitiveness. Product similarity can help reduce duplicate content leveraging all types of product information. But dealing with mixed-type data such as product data is a rather untypical but real business case and can be challenging.
Uroš Valant has almost 20 years of experience in planning, managing and delivering of various IT projects. He has the best and richest experience in the field of business analytics, project planning and implementation, database design and the management of development teams. In the last years, his focus is the field of predictive analytics, machine learning and applying the AI solution to a practical use in different field of work.
In his talk he will present to us interactive case study of the image recognition use and AI assisted design techniques in the textile industry.
The presentation will start as an engaging lecture where I will present the motivation behind the project based on my academic research (my Oxford PhD among others). I will tell the audience just how rampant corruption is in local governance and why is it so persistent. Then I will present our remedy: full budget transparency. I will show them our search engine and how it works, and will call the participants to download the APIs and play with the data themselves.
The talk will be divided into two parts. The first one is about geospatial open data and several Copernicus services where those data can be downloaded. The second one is about Forest and Climate project, as an example of geospatial analysis. The aim of the project was to identify the most suitable area for afforestation in Serbia by using satellite and Earth observation data. The results can be found at https://sumeiklima.org/.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Tabula.io Cheatsheet: automate your data workflows
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
1. Milos Milovanovic, Co-Founder & Data Engineer @ Things Solver
milos@thingsolver.com
milos@datascience.rs
Planning and Optimizing
Data Lake Architecture
2. Agenda
Introduction - Business Data Requirements
What is A Data Lake?
A Common Data Lake Architecture
When Problems Start To Show Up - Optimizing Data Lake
Expanding a Data Lake
How To Plan Data Lake - Success Factors
3. Introduction - Business Data Requirements
Main goal for organizations is to adapt and put all of their data into use.
It’s not an easy task - it might require the mindset and structural changes.
Flexibility and agility are required for success.
Various trends and buzzwords are making it hard to stay on track.
Challenge of Transforming Enterprise Data Management - (“The data lake is a
foundational component and common denominator of the modern data architecture
enabling, and complementing specialized components, such as enterprise data
warehouses, discovery-oriented environments, and highly-specialized analytic or
operational data technologies…” - John O’Brien, CEO @ Radiant Advisors)
4. Data Lake - The Very First Definition
“If you think of a datamart as a store of bottled water – cleansed and packaged
and structured for easy consumption – the data lake is a large body of water in a
more natural state. The contents of the data lake stream in from a source to fill
the lake, and various users of the lake can come to examine, dive in, or take
samples.”
- James Dixon, CTO @ Pentaho
5. A More Formal Definition
“A data lake is a storage repository that holds a
vast amount of raw data in its native format,
including structured, semi-structured, and
unstructured data. The data structure and
requirements are not defined until the data is
needed.”
7. Data Warehouse & Data Lake by Example
Social Media Streaming can be implemented using traditional Data Warehouse
… but such an application will be to restricted and inflexible (extending the
number of columns analyzed).
Using Data Lake for this purpose gives us flexibility to adapt and test new metrics
… and we can easily add new applications on top.
9. A Common Data Lake Implementation Architecture
❏ In general, the architecture of a data lake is simple: a Hadoop File System
(HDFS) with lots of directories and files on it.
❏ Hadoop is usually in the center of Data Lake Architecture, although the concept
is broader than Hadoop.
❏ Hadoop’s scalable, low-cost persistence layer and its ability to perform big data
processing and analytics is a great toolset to achieve measurable business value
opportunities at speed and low cost.
❏ Hive and Spark provide us rich analytics on top of the data that is persisted at
low cost.
10. This Architecture:
Acts like SQL
Efficient and Scalable
Connects to Basically Anything
Different Processing Modes
(Realtime, Batch, Pipelines, Machine
Learning, Ad Hoc Analysis …)
HADOOP
DISTRIBUTED
FILE
SYSTEM
HIVE AND SPARK
DATA SOURCES
11. When Problems Show Up
Hadoop + Spark/Hive != Database
- Searching a row within TBs of Data
select * from my_table where some_column like ’%123asd%’;
- No updates and deletes
- Too many concurrent requests from BI Tools
...
Spark Best Practice: http://go.databricks.com/not-your-fathers-database
12. How Do We Optimize Such a Solution?
❏ Use ORC File Format
❏ File Compaction (small files, deduplication)
❏ Run Spark on YARN
❏ Use Spark Dataframes
❏ Data Caching
❏ Use Traditional Databases
❏ Extend the Toolset (Solr, ES, Kafka, Redis, …)
13. Data Lake - Extended Toolset
HDFS
AND MANY MORE...
14.
15. How To Start With The Data Lake?
❏ Think of the Use Cases (don’t plan all the use cases - have some in mind)
❏ Master the Technology
❏ Go agile and flexible
❏ Do not forget about the Data Governance, Data Quality, Security (but do not
drown in this)
❏ Integrate with BI and DWH
17. Milos Milovanovic, Co-Founder & Data Engineer @ Things Solver
milos@thingsolver.com
milos@datascience.rs
Planning and Optimizing
Data Lake Architecture