The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

  • 970 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
970
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Have someone introduce me.
    Thank audience (tie to morning activities), sponsors, HP, etc.

    We’re here because this is the biggest thing that has happened to Hadoop…
  • Here at the conference we’re talking about data science. But before we can appreciate the changes happening in data science, we must first talk about Data. Data is doubling every two years. The fast growing volume, variety and velocity of data is overwhelming traditional systems and approaches. A revolutionary approach is required to leverage this data. And with this new technology, Data science as we know, is undergoing tremendous change.
  • To give you a sense of the data volumes that we’re talking about, I’ve included this chart that shows why a revolutionary approach is needed. You can see the amount of data growth moving from 1.8 Zettabytes to 44 Zettabytes in just over 5 years. To put this into perspective a large datawarehouse contains terabytes of data. A zettabye is 1 billion terabytes.


    Numbers in chart are from two IDC reports (sponsored by emc).
    http://www.emc.com/collateral/about/news/idc-emc-digital-universe-2011-infographic.pdf
    http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
  • What is the source of this data growth? While structured data growth has been relatively modest, the growth in unstructured data has been exponential.


    Source of statistic: http://link.springer.com/chapter/10.1007/978-3-642-39146-0_2
  • sensor data, social media, clickstream, genomic data, location information, video files, etc.
  • The system that is enabling this growth in data capture is Hadoop.
  • We are proud/fortunate that Forrester has named MapR as the best Hadoop distribution in the market.
  • 8
  • 9
  • Many organizations now want to unlock the data in Hadoop and make it accessible to a broader audience within their organizations. That’s easier said than done. While we’ve largely solved the infrastructure scalability challenge, the massive volume, variety and velocity of this data introduces serious challenges on the human side, such as how to prepare all that data and make it available to users, how to make operational data available in real-time for analytics, etc. We need better technology to empower users to take advantage of these massive volumes of data.

    Past: Enable organizations to capture the data.
    Future: Enable organizations to more easily extract value from all this captured data.

    What does the future of Hadoop look like?

    The problem
    I’m sure many of you have experienced this (just like the quotes)
    Why we want to solve it
    Here’s what we’re doing about it
  • One of the challenges with Hadoop as well as traditional data management tools is the business user’s “distance from the data”.
    The dependency on IT (or additional development) increases time to value and reduces agility. It also creates a burden on IT at a time when IT is already overworked. The red arrows in this illustration can represent significant backlogs and delays (often many months).

    Many of you are likely having to spend a lot of time on plumbing development and data preparation. How many have had to do this? (show hand)
  • “Data modeling and transformations” may seem easy, but when you look at a real-world environment, you could have thousands of data sets.
  • Opportunity
  • This is the opportunity.
    The audience should feel like this is their chance to become heroes by bringing this to their companies.
    They have to feel (be emotional) about the problem at this point.
  • IT-driven = months of delay, unnecessary work (data is no longer relevant, etc.)
    The so-what needs to be conveyed. Why does it matter that it’s not needed.

    6 months -> 3 months -> 3 months -> day zero

    So imagine now what you can get…

    Data Agility is needed for Business Agility


    >>> Stand still during slide, move in at the punchline (why does this matter to YOU)
  • Need an example or analogy to explain self-describing data.
  • All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before.

    If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.
  • TODO: Add Impala and Splunk logos
  • What I want you to see now is how easy is it to ….
  • Is there something from Israel?
  • With other technologies you have to do this, then this, then this, …
  • Key takeaways
    Core message – We are revolutionizing Hadoop
    Call to action – get involved, and enjoy the conference as we have great speakers


    If doing Q&A, set boundaries (time - how much time we have, topic – what questions can I answer about this revolution), back pocket question (someone asked me this morning)
    -

Transcript

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies The Future of Hadoop: Data Agility
  • 2. © 2014 MapR Technologies 2 Data is doubling in size every two years
  • 3. © 2014 MapR Technologies 3 44 ZETTABYTES 4.4 ZETTABYTES 2011 2013 1.8 ZETTABYTES IDC estimates that in 2020, there will be 44 zettabytes of data in the world 2020 Source: IDC Digital Universe
  • 4. © 2014 MapR Technologies 4 UNSTRUCTURED DATA STRUCTURED DATA 1980 2000 20101990 2020 Unstructured data will account for more than 80% of the data collected by organizations Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data TotalDataStored
  • 5. © 2014 MapR Technologies 5 Unstructured Data is Ubiquitous Social Media Messages Audio Sensors Mobile Data Email Clickstream
  • 6. © 2014 MapR Technologies 6 Hadoop Adoption is Exploding JOB TRENDS FROM INDEED.COM Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13
  • 7. © 2014 MapR Technologies 7 The MapR Distribution for Hadoop Best Product Exponential Growth 3X bookings Q1 ‘13 – Q1 ‘14 80% of accounts expand 3X 90% software licenses <1% lifetime churn >$1B in incremental revenue generated by 1 customer 500+ CustomersBig Data Riding the Wave with Hadoop The Big Data Platform of Choice
  • 8. © 2014 MapR Technologies 8 360° Customer View 5PB CUSTOMER DATA
  • 9. © 2014 MapR Technologies 9PEOPLE 1.2B PEOPLE Largest Biometric Database in the World
  • 10. © 2014 MapR Technologies 10© 2014 MapR Technologies The Future of Hadoop: Data Agility
  • 11. © 2014 MapR Technologies 11 Distance to Data Business (analysts, developers) “Plumbing” development MapReduce Business (analysts, developers) Modeling and transformations Hive and other SQL-on-Hadoop Existing approaches require a middleman (IT) Data Data
  • 12. © 2014 MapR Technologies 12 Real-World Data Modeling and Transformations
  • 13. © 2014 MapR Technologies 13
  • 14. © 2014 MapR Technologies 14 Distance to Data Business (analysts, developers) “Plumbing” development MapReduce Hive and other SQL-on-Hadoop Business (analysts, developers)Data Agility Existing approaches require a middleman (IT) Data Data Data Business (analysts, developers) Modeling and transformations
  • 15. © 2014 MapR Technologies 15 Why Improve Distance to Data? • Enable rapid data exploration and application development • IT should provide a valuable service without “getting in the way” • Can’t add DBAs to keep up with the exponential data growth • Minimize “unnecessary work” so IT can focus on value-added activities and become a partner to the business users 2Reduce the burden on ITImprove time to value
  • 16. © 2014 MapR Technologies 16 • Pioneering Data Agility for Hadoop • Apache open source project • Scale-out execution engine for low-latency queries • Unified SQL-based API for analytics & operational applications APACHE DRILL 40+ contributors 150+ years of experience building databases and distributed systems
  • 17. © 2014 MapR Technologies 17 Evolution Towards Self-Service Data Exploration Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Not needed Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  • 18. © 2014 MapR Technologies 18 (1) Self-Describing Data is Ubiquitous Flat files in DFS • Complex data (Thrift, Avro, protobuf) • Columnar data (Parquet, ORC) • Loosely defined (JSON) • Traditional files (CSV, TSV) Data stored in NoSQL stores • Relational-like (rows, columns) • Sparse data (NoSQL maps) • Embedded blobs (JSON) • Document stores (nested objects) { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  • 19. © 2014 MapR Technologies 19 (2) Drill’s Data Model is Flexible HBase JSON BSON CSV TSV Parquet Avro Schema-lessFixed schema Flat Complex Flexibility Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC } RDBMS/SQL-on-Hadoop table Apache Drill table
  • 20. © 2014 MapR Technologies 20 (3) Drill Supports Schema Discovery On-The-Fly • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data 2Schema Discovered On-The-FlySchema Declared In Advance SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 21. © 2014 MapR Technologies 21© 2014 MapR Technologies Quick Tour Self-Service Data Exploration with Apache Drill
  • 22. © 2014 MapR Technologies 22 • d
  • 23. © 2014 MapR Technologies 23 Zero to Results in 2 Minutes (3 Commands) $ tar xzf apache-drill.tar.gz $ apache-drill/bin/sqlline -u jdbc:drill:zk=local 0: jdbc:drill:zk=local> SELECT count(*) AS incidents, columns[1] AS category FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv` GROUP BY columns[1] ORDER BY incidents DESC; +------------+------------+ | incidents | category | +------------+------------+ | 8372 | LARCENY/THEFT | | 4247 | OTHER OFFENSES | | 3765 | NON-CRIMINAL | | 2502 | ASSAULT | ... 35 rows selected (0.847 seconds) Install Launch shell (embedded mode) Query Results
  • 24. © 2014 MapR Technologies 24 A storage engine instance - DFS - HBase - Hive Metastore/HCatalog A workspace - Sub-directory - Hive database A table - pathnames - HBase table - Hive table Data Source is in the Query SELECT timestamp, message FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` WHERE errorLevel > 2
  • 25. © 2014 MapR Technologies 25 Query Directory Trees # Query file: How many errors per level in Jan 2014? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` GROUP BY errorLevel; # Query directory sub-tree: How many errors per level? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs` GROUP BY errorLevel; # Query some partitions: How many errors per level by month from 2012? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs` WHERE dirs[1] >= 2012 GROUP BY errorLevel, dirs[2];
  • 26. © 2014 MapR Technologies 26 Works with HBase and Embedded Blobs # Query an HBase table directly (no schemas) SELECT cf1.month, cf1.year FROM hbase.table1; # Embedded JSON value inside column profileBlob inside column family cf1 of the HBase table users SELECT profile.name, count(profile.children) FROM ( SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile FROM hbase.users )
  • 27. © 2014 MapR Technologies 27 Combine Data Sources on the Fly # Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message. SELECT DISTINCT users.name, users.emails.work FROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` users WHERE logs.uid = users.id AND logs.errorLevel > 5; # Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user SELECT users.name, count(*) as tweetCount FROM hive.social.tweets tweets, hbase.users users WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8') GROUP BY tweets.userId;
  • 28. © 2014 MapR Technologies 28 Summary • Enable rapid data exploration and application development while reducing the burden on IT • Apache Drill beta coming soon – Email tshiran@mapr.com • Get involved – Download and play: http://incubator.apache.org/drill/ – Ask questions: drill-user@incubator.apache.org – Contribute: http://github.com/apache/incubator-drill/
  • 29. © 2014 MapR Technologies 29 Thank You @mapr maprtech tshiran@mapr.com Tomer Shiran, VP Product Management MapRTechnologies maprtech mapr-technologies