Introducing the Big Data Ecosystem with Caserta Concepts & Talend

2,135 views
1,871 views

Published on

In this one-hour webinar, Caserta Concepts and Talend described an approach to achieve an architectural framework and roadmap to extend a traditional enterprise data warehouse environment, into a Big Data ecosystem.

They illustrated the architectural components involved for collecting, analyzing and delivering Big Data, with a focus on the importance of Hadoop, Data Integration, Machine Learning, NoSQL, Business Intelligence and Analytics.

Attendees learned:
Which Big Data technologies can’t be ignored
Considerations when extending the data ecosystem
What happens to your existing investment
What are the points of integration
Does Big Data = better data?

To find access the recorded webinar or to learn more, visit http://www.casertaconcepts.com/.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,135
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
52
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Purpose of the slide: Mission / Vision StatementKey themes:Talend’s mission is to enable our customers to innovate faster at a lower cost.We are disrupting the traditional integration market by delivering an: open source-based solution, innovative unified platform, usage-based subscription modelMore from the Talend boilerplate:Talend provides integration that truly scales. From small projects to enterprise-wide implementations, Talend’s highly scalable data, application and business process integration platform maximizes the value of an organization’s information assets and optimizes return on investment through a usage-based subscription model. Ready for big data environments, Talend’s flexible architecture easily adapts to future IT platforms. And a common set of easy-to-use tools implemented across all Talend products enable teams to scale developer skillsets, too.
  • Purpose of the slide: IntroduceTalend’s solution – Integration At Any ScaleTalking points:Talend is disrupting the integration market to address these integration challenges by providing a differentiated solution that provides “Integration at Any Scale”With Talend, your business can scale to meet any integration challenge, any data volume, or any project size.We will discuss HOW this is done in a moment, but the main point here is what we call “Integration Convergence”Integration Convergence is the ability to address data, application and process integration needs with the same platformThe benefit to you, is that your resources are more efficient and you lower your cost of operationsTalend provides integration that truly scales. From small projects to enterprise-wide implementations, Talend’s highly scalable data, application and business process integration platform maximizes the value of an organization’s information assets and optimizes return on investment through a usage-based subscription model. Ready for big data environments, Talend’s flexible architecture easily adapts to future IT platforms.
  • Endeca bought by Oracle – “agile information management”SSPS bought by IBMRadian6 bought by SalesforceDataStax – cassandraKarmasphere – data analysis platform for HadoopCouchbase – NoSQL – Membase and CouchbaseClarabridge – text analytics
  • Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB
  • Endeca bought by Oracle – “agile information management”SSPS bought by IBMRadian6 bought by SalesforceDataStax – cassandraKarmasphere – data analysis platform for HadoopCouchbase – NoSQL – Membase and CouchbaseClarabridge – text analytics
  • Introducing the Big Data Ecosystem with Caserta Concepts & Talend

    1. 1. The Big Data EcosystemTalend & Caserta Concepts WebinarCiaran DynesDirector, Product Management & Product Marketing, TalendJoe CasertaFounder & President, Caserta Concepts
    2. 2. Integration at Any ScaleTalend is the only integration vendor that enablesyour business to scale through: An open source-based solution supported by a vast community and enterprise-class services An innovative, unified platform that scales data, application and business processes of any complexity A usage-based subscription model delivering $ a fast return on investment
    3. 3. Talend - Integration at Any ScaleTalend offers truescalability for• Any integration challenge• Any data volume• Any project sizeTalend enablesintegrationconvergence
    4. 4. Working with Leading VendorsPlatforms/Hadoop Appliance NoSQL Data Management Analytics System IntegratorsSystem Integrators play a vital role in providing expertise
    5. 5. The Big Data EcosystemTalend & Caserta Concepts WebinarJoe CasertaFounder & President, Caserta ConceptsCiaran DynesDirector, Product Management & Product Marketing, Talend
    6. 6. Joe Caserta Timeline 2012 Partnered with Big Data vendors Laser focus on Big Data solutions for Cloudera, HortonWorks, Datameer, Financial Sector & eCommerce more… 2010 Formalized Talend Alliance 2009 Partnership – System Integrators Launched Big Data practice 2004 Co-author, with Ralph Kimball, The Launched Training practice, teaching Data Warehouse ETL Toolkit (Wiley) data concepts world-wide 2001 Web log analytics solution published Founded Caserta Concepts in NYC in Intelligent Enterprise 1996 Began consulting career as Dedicated to Data Warehousing, programmer/data modeler Business Intelligence since 1996 1986 25+ years hands-on experience building database solutions
    7. 7. Caserta Concepts• Technology services company with expertise in data analysis: • Data Management • Big Data & Analytics• With core focus in the following industries: • Financial Services • Insurance / Healthcare • eCommerce / Higher Education• Established in 2001: • Increased growth year-over-year • Industry recognized work force • Consulting, Writing, Education
    8. 8. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Master Data Management
    9. 9. Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
    10. 10. The Good Old Days: Traditional Data Warehousing Metadata Standard Reports Web Logs Ad-hoc Query Tools External Extract Data Sources Optimized Load Transform Data Mining Data Warehouse Relational Systems/ERP MDD/OLAP Closed-loop Legacy feedback Analytical Applications Systems applications Data Marts (The data warehouse?)
    11. 11. What is “Big Data”?• A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.• Challenges include capture, storage, search, sharing, transfer, analysis, and visualization.• Relational databases were designed for applications, we use only a small fraction of their capabilities in analytics applications.• Enforcing a relational structure upon our data is not always what we want.
    12. 12. What’s the Difference? Traditional Data Big DataVery accurate transactional data. Lots of data with value that canAnalyzed by humans only be attained by deep analyticsMeasured in terabytes Measured in petabytesStructured data Structured/Unstructured dataInput by human “system users” Created by everybody, plus all of our machine friendsOracle, SAP, etc. Open source, HadoopHW/SW investment measured in HW/SW investment measured in$10M $10KRecording facts Harvesting insights
    13. 13. Try to keep up: This slide is already obsolete
    14. 14. So where does the data warehouse come in? • Will Big Data replace the data warehouse? • Yes – however there is much evolution ahead: real time integrations, interactive queries • Data Warehousing principles still apply to Big Data • Data Quality • Master Data • Data architecture • How do we leverage our existing investment?
    15. 15. Enterprise Technical Ecosystem Traditional BI ERP ETL Traditional EDW Finance Ad-Hoc/Canned ETL Reporting Legacy Big Data Cluster Big Data BI NoSQL Database Cassandra Search/Data Analytics Mahout MapReduce Pig/Hive N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS) Horizontally Scalable Environment - Optimized for Analytics Canned Reporting
    16. 16. Extending EDW with Hadoop•Eliminate barrier of imposing relational structure on data.•Storage is fast, durable and cheap: Don’t throw away data thatcan be valuable in the future•Processing power • Hadoop scales linearly, don’t worry about the data set getting too big•Machine learning•Ad-Hoc reporting by non-technical users requires traditionalmethods or additional application
    17. 17. Design Pattern #1: Hadoop Staging/Warehousefeed relational EDW (Composite Warehouse) • Hadoop serves as the staging ground for all data - Eliminate barrier of imposing relational structure on data. - Storage is fast, durable and cheap: Don’t throw away data that can be valuable in the future • Data scientists will work in the Hadoop environment to analyze, and mine structured and unstructured data using Pig, Hive, and Mahout (machine learning) • Data required for interactive reporting and traditional ad-hoc analysis is sent to downstream relational EDW Source Systems Mahout MapReduce Pig/Hive Traditional DW N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS)
    18. 18. Design Pattern #2: NoSQL Enhanced EDW •Not all structured data lends itself to being stored relationally: • Relationships: Graph Databases • Sparse Data: Columnar Databases •Very Large Datasets: • NoSQL databases are capable of scaling far beyond relational databases while maintaining performance • Ultra-performance key value stores and columnar databases can be very useful in storing certain types of high volume data for analytic purposes • Just don’t expect the ad-hoc flexibility of a relational database! - Web analytics Mahout MapReduce Pig/Hive Cassandra - Ad Impressions (columnar) N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS) - Networks Titan - Recommender (graph) - Path optimization Traditional DW
    19. 19. Design Pattern #3: Add analytics to your NoSQLcluster • If your application is already based on a NoSQL technology, consider building analytic site. • The analytic site is constantly streamed fresh transactions leveraging Cassandras native replication • Aggregates and analytic views are materialized with Pig/Hive map/reduce, since the work is done on the cluster no load is placed on the applications. This analytic data is in turn replicated throughout the cluster Site 1 Cassandra Pig/Hive Cassandra MapReduce Analytics Site Site 2 Canned Reporting Cassandra Remember, NoSQL schemas are Traditional “optimized to a DW query”, not ad-hoc
    20. 20. Emerging Tools Hive, although an excellent tool for data analysis is too slow for interactive queries. Recent projects have increased speed dramatically 10-100x. • Google Dremel • Apache/MapR Drill • Hortonworks Stinger • Cloudera Impala
    21. 21. Commonly Used Technologies• Amazon Elastic MapReduce (EMR): Web service to access EC2/S3, pay-as-you-go hosted Hadoop Infrastructure• Hadoop Distribution: Cloudera; MapR; Hortonworks• Apache Projects • Whirr: Used to launch/kill computing clusters • Kafka: Publish-subscribe messaging system • Mahout: Distributed machine learning • Hive: Map data to structures and use SQL-like queries • HBase: No-SQL/non-relational database, real-time read/write • Cassandra: Like HBase, no single point of failure • Chuckwa/Flume: Large-scale log collection • Pig: Procedural programming language, from Yahoo • Sqoop: “SQL-to-Hadoop”, like BCP for Hadoop • Zookeeper: Used to manage & adminster Hadoop • Solr: Full-text/Faceted Search • MongoDB: Document-oriented database• Languages: Python, SciPy, Java
    22. 22. Leading Vendors (According to Joe) Hadoop NoSQL Analytics Data Management
    23. 23. Parting Thought Polyglot Persistence – “where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly well be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” -- Martin Fowler
    24. 24. Questions?Please ask your questions now using the Q&A panel
    25. 25. Resources➜ Recording will be made available on www.talend.com/resources/webinars➜ Request a copy of the slides webinar@talend.com➜ Contact Talend Sales • Email: sales@talend.com • Phone: 714.786.8140➜ Contact Caserta Concepts • Joe Caserta, President • Email: joe@casertaconcepts.com • Phone: 855.755.2246 x227© Talend 2012

    ×