SlideShare a Scribd company logo
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
From Batch to Streaming:
H o w A m a z o n F l e x U s e s R e a l - t i m e A n a l y t i c s t o D e l i v e r P a c k a g e s o n T i m e
N o v e m b e r 2 8 , 2 0 1 7
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Real-time streaming data overview
• Streaming data services
• Benefits of streaming analytics
• Batch to streaming best practices
• How Amazon Flex moved from batch to streaming
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is batch processing?
Execution of a series of jobs in a program on a
computer without manual intervention - Wikipedia
• Data is collected over a period of time
• Process and analyze on a schedule
• Combine several processes to obtain final result
Most data is produced continuously
Mobile apps Web clickstream Application logs
Metering records IoT sensors Smart buildings
The diminishing value of data
Recent data is highly valuable
• If you act on it in time
• Perishable insights (M. Gualtieri,
Forrester)
Old + recent data is more
valuable
• If you have the means to combine
them
Processing real-time, streaming data
• Durable
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
Collect Transform Analyze React Persist
Amazon Kinesis makes it easy to work with real-
time streaming data
Kinesis Streams
• For technical developers
• Collect and stream data
for ordered, replayable,
real-time processing
Kinesis Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into Amazon S3, Redshift,
ElasticSearch
Kinesis Analytics
• For all developers, data
scientists
• Easily analyze data streams
using standard SQL queries
• Compute analytics in
real time
Amazon Kinesis Streams
• Reliably ingest and durably store streaming data at low cost
• Build custom real-time applications to process streaming data
• Use your stream-processing framework of choice
Amazon Kinesis Firehose
• Reliably ingest and deliver batched, compressed, and
encrypted data to S3, Redshift, and Elasticsearch
• Point and click setup with zero administration and
seamless elasticity
• Managed stream-processing consumer
Amazon Kinesis Analytics
• Interact with streaming data in real time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time
visualizations and alarms
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of streaming analysis
Immediate results
• Real-time
aggregations
• Filtering
• Anomaly detection
Reduced
complexity
• Fewer scheduled
jobs to manage
• Kinesis is a fully-
managed solution
Scalable
• Enables parallel
processing
• Horizontally
scales, based on
your ingest rate
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Batch to streaming best practices
Migrate incrementally
• Don’t boil the ocean
• Begin by streaming data
in parallel to existing
batch processes
• Persist streaming data
into durable storage, like
Amazon S3
• Add in streaming
analysis results to
replace batch analysis
Application databases Data warehouseData producer
Amazon Kinesis
ETL
ETL
Amazon S3
Streaming
data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Batch to streaming best practices
Perform ITL rather than ETL
• ITL: Ingest-Transform-Load
• ETL: Extract-Transform-Load
• Transform data in near-real time
rather than a scheduled job
• Enrich data in near-real time
• Persist transformed and/or
enriched data
Data producer
Amazon Kinesis
Firehose
Raw streaming
data
AWS Lambda
function
Amazon S3
Transformed
data
Transform
data
Enrichment
source data
Raw data Transformed and/or
enriched data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Batch to streaming best practices
Aggregate upon arrival
• Continuously write raw data
to persistent data store for
archival and other analysis
• Aggregate in real time when
window size < 1 hour
• Write aggregated data to
persistent data store for
immediate value
Amazon Kinesis
Firehose
Raw streaming
data
Amazon S3
Raw
data
Aggregated
data
Amazon Kinesis
Analytics
Aggregate
Results
Data producer
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Batch to streaming example
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Brandon Smith
• Senior software engineer
• Worked at Amazon for 12 years in Kindle, AWS, and now Last Mile Delivery
• Currently working on Amazon Flex
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Amazon delivery app (Android/iOS)
• Crowd-sourced model launched in
30+ U.S. cities
• Used by Amazon Logistics worldwide
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Deliveries for Amazon.com, Prime
Now, Amazon Fresh, restaurants,
grocery stores
• Millions of packages per year
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The problem
• Collecting, processing, and storing telemetry data
• Telemetry data = remote measurements
• Includes metrics, crashes, logs, sensor data, clickstream data, etc.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The goal
• Understand what’s happening in the field
• Analyze all the data and make performance optimizations
• Focus our time on improving the app and the delivery flow
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use cases
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case 1: Alarming
• We want to know within minutes if there are problems
• Example: If the delivery count drops below our expected/historical value,
we want to alarm
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case 2: Troubleshooting
• Logs and crashes published to AWS CloudWatch Logs in near-real time
• Can filter and search to troubleshoot issues
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case 3: Dashboards
• We can write SQL, generate reports, and create visualizations
• But we really want real-time dashboards instead of daily reports
Daily reports Real-time dashboards
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case 4: Releases
• Deploying new app versions and monitoring adoption in real time
• Release new code smoothly and with confidence
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case 5: Sharing data
• Consumers get notifications of new data in real time
• Consumers can join their data with other data in the data lake
S3 bucket Data lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case 6: Deeper analytics
• Look at the stream of data and the historical data
• Build ML models, create predictions, detect anomalies
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How did we build it?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting from batch to streaming
• To solve our use cases, we had to incrementally improve our system
• We evolved from a batch-based system to a stream-based system
• Let’s walk through the iterations
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Collect metrics and send to an existing metrics service
• ETL jobs to load data into a big Oracle Data Warehouse
Iteration 1: Use existing systems
Existing metrics serviceApp DW
ETL
Data
collection
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Batch process with 24-hour delay
2. Fixed, inflexible DB schema
3. Analysis difficult and slow via SQL
Iteration 1: Use existing systems
Existing metrics serviceApp DW
ETL
Data
collection
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Collect metrics in the app using AWS Amazon Mobile Analytics SDK,
which automatically loads data into Redshift
Iteration 2: Use AWS
App
CloudFormation
ETL system
Data
collection
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Batch process with 24-hour delay 2-hour delay
2. Fixed, inflexible DB schema
3. Analysis difficult and slow via SQL
Iteration 2: Use AWS
App
CloudFormation
ETL system
Data
Collection
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Add shared configuration that is used in the app and automatically
updates the Redshift schema
Iteration 3: Automated DB schema
App
CloudFormation
ETL system
Data
collection
Schema config
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Batch process with 24-hour delay 2-hour delay
2. Fixed, inflexible Auto-updating DB schema
3. Analysis difficult and slow via SQL
Iteration 3: Automated DB schema
App
Schema config
CloudFormation
ETL system
Data
collection
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Introduce a Kinesis stream and Kinesis Firehose to publish to Redshift
• Partition data by date to simplify data retention policies
Iteration 4: Use Streams
App
Data
collection Via Pinpoint
Schema
config
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Batch Streaming process with 24-hour 2 hour a delay of a couple
minutes
2. Fixed, inflexible Auto-updating DB schema
3. Analysis difficult and slow via SQL
Iteration 4: Use Streams
App
Data
collection Via Pinpoint
Schema
config
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Use generic message types
• Publish the data to:
• S3
• Redshift
• ElasticSearch
Iteration 5: Generic message types
App
ElasticSearch
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Iteration 5
App
Data
collection
ElasticSearch
Consumer Lambdas
SQL reports
Dashboards
ProtoBuf
Consumer Redshifts
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Batch Streaming process with 24-hour 2 hour a few seconds delay
2. Fixed, inflexible Auto-updating DB schema and generic message types
3. Analysis difficult and slow via SQL flexible by processing message payload
Iteration 5: Generic message types
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data flow
App
ElasticSearch
Consumer Redshifts
Consumer Lambdas
SQL reports
Dashboards
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Future improvements
Some ideas to make the system even better:
1. Use Kinesis Analytics to query the real-time data stream
2. Use AWS Athena to query data directly from S3
3. Use AWS Amazon AI Services to do deeper data analysis
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Summary
Did we solve our use cases?
1. Real-time metrics and alarming
2. Real-time dashboards
3. Real-time logs and crash troubleshooting
4. Monitoring new releases
5. Sharing data with other teams
6. Deeper analytics
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of Streaming
1. Agility: real-time data means your business can react quicker
2. Flexibility: generic message types give you flexible schemas so your
system can handle multiple data types and future use cases
3. Shareability: streams allow you to multiplex and share your data easily
with your consumers
4. Extensibility: Processing streams of data allows us to write it to
multiple data storage systems, which enables a variety of analytics
tools
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!

More Related Content

ABD217_From Batch to Streaming

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT From Batch to Streaming: H o w A m a z o n F l e x U s e s R e a l - t i m e A n a l y t i c s t o D e l i v e r P a c k a g e s o n T i m e N o v e m b e r 2 8 , 2 0 1 7
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Real-time streaming data overview • Streaming data services • Benefits of streaming analytics • Batch to streaming best practices • How Amazon Flex moved from batch to streaming
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is batch processing? Execution of a series of jobs in a program on a computer without manual intervention - Wikipedia • Data is collected over a period of time • Process and analyze on a schedule • Combine several processes to obtain final result
  • 4. Most data is produced continuously Mobile apps Web clickstream Application logs Metering records IoT sensors Smart buildings
  • 5. The diminishing value of data Recent data is highly valuable • If you act on it in time • Perishable insights (M. Gualtieri, Forrester) Old + recent data is more valuable • If you have the means to combine them
  • 6. Processing real-time, streaming data • Durable • Continuous • Fast • Correct • Reactive • Reliable What are the key requirements? Collect Transform Analyze React Persist
  • 7. Amazon Kinesis makes it easy to work with real- time streaming data Kinesis Streams • For technical developers • Collect and stream data for ordered, replayable, real-time processing Kinesis Firehose • For all developers, data scientists • Easily load massive volumes of streaming data into Amazon S3, Redshift, ElasticSearch Kinesis Analytics • For all developers, data scientists • Easily analyze data streams using standard SQL queries • Compute analytics in real time
  • 8. Amazon Kinesis Streams • Reliably ingest and durably store streaming data at low cost • Build custom real-time applications to process streaming data • Use your stream-processing framework of choice
  • 9. Amazon Kinesis Firehose • Reliably ingest and deliver batched, compressed, and encrypted data to S3, Redshift, and Elasticsearch • Point and click setup with zero administration and seamless elasticity • Managed stream-processing consumer
  • 10. Amazon Kinesis Analytics • Interact with streaming data in real time using SQL • Build fully managed and elastic stream processing applications that process data for real-time visualizations and alarms
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of streaming analysis Immediate results • Real-time aggregations • Filtering • Anomaly detection Reduced complexity • Fewer scheduled jobs to manage • Kinesis is a fully- managed solution Scalable • Enables parallel processing • Horizontally scales, based on your ingest rate
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Batch to streaming best practices Migrate incrementally • Don’t boil the ocean • Begin by streaming data in parallel to existing batch processes • Persist streaming data into durable storage, like Amazon S3 • Add in streaming analysis results to replace batch analysis Application databases Data warehouseData producer Amazon Kinesis ETL ETL Amazon S3 Streaming data
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Batch to streaming best practices Perform ITL rather than ETL • ITL: Ingest-Transform-Load • ETL: Extract-Transform-Load • Transform data in near-real time rather than a scheduled job • Enrich data in near-real time • Persist transformed and/or enriched data Data producer Amazon Kinesis Firehose Raw streaming data AWS Lambda function Amazon S3 Transformed data Transform data Enrichment source data Raw data Transformed and/or enriched data
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Batch to streaming best practices Aggregate upon arrival • Continuously write raw data to persistent data store for archival and other analysis • Aggregate in real time when window size < 1 hour • Write aggregated data to persistent data store for immediate value Amazon Kinesis Firehose Raw streaming data Amazon S3 Raw data Aggregated data Amazon Kinesis Analytics Aggregate Results Data producer
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Batch to streaming example
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Brandon Smith • Senior software engineer • Worked at Amazon for 12 years in Kindle, AWS, and now Last Mile Delivery • Currently working on Amazon Flex
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Amazon delivery app (Android/iOS) • Crowd-sourced model launched in 30+ U.S. cities • Used by Amazon Logistics worldwide
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Deliveries for Amazon.com, Prime Now, Amazon Fresh, restaurants, grocery stores • Millions of packages per year
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The problem • Collecting, processing, and storing telemetry data • Telemetry data = remote measurements • Includes metrics, crashes, logs, sensor data, clickstream data, etc.
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The goal • Understand what’s happening in the field • Analyze all the data and make performance optimizations • Focus our time on improving the app and the delivery flow
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use cases
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 1: Alarming • We want to know within minutes if there are problems • Example: If the delivery count drops below our expected/historical value, we want to alarm
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 2: Troubleshooting • Logs and crashes published to AWS CloudWatch Logs in near-real time • Can filter and search to troubleshoot issues
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 3: Dashboards • We can write SQL, generate reports, and create visualizations • But we really want real-time dashboards instead of daily reports Daily reports Real-time dashboards
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 4: Releases • Deploying new app versions and monitoring adoption in real time • Release new code smoothly and with confidence
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 5: Sharing data • Consumers get notifications of new data in real time • Consumers can join their data with other data in the data lake S3 bucket Data lake
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 6: Deeper analytics • Look at the stream of data and the historical data • Build ML models, create predictions, detect anomalies
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How did we build it?
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting from batch to streaming • To solve our use cases, we had to incrementally improve our system • We evolved from a batch-based system to a stream-based system • Let’s walk through the iterations
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Collect metrics and send to an existing metrics service • ETL jobs to load data into a big Oracle Data Warehouse Iteration 1: Use existing systems Existing metrics serviceApp DW ETL Data collection
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Batch process with 24-hour delay 2. Fixed, inflexible DB schema 3. Analysis difficult and slow via SQL Iteration 1: Use existing systems Existing metrics serviceApp DW ETL Data collection
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Collect metrics in the app using AWS Amazon Mobile Analytics SDK, which automatically loads data into Redshift Iteration 2: Use AWS App CloudFormation ETL system Data collection
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Batch process with 24-hour delay 2-hour delay 2. Fixed, inflexible DB schema 3. Analysis difficult and slow via SQL Iteration 2: Use AWS App CloudFormation ETL system Data Collection
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Add shared configuration that is used in the app and automatically updates the Redshift schema Iteration 3: Automated DB schema App CloudFormation ETL system Data collection Schema config
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Batch process with 24-hour delay 2-hour delay 2. Fixed, inflexible Auto-updating DB schema 3. Analysis difficult and slow via SQL Iteration 3: Automated DB schema App Schema config CloudFormation ETL system Data collection
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Introduce a Kinesis stream and Kinesis Firehose to publish to Redshift • Partition data by date to simplify data retention policies Iteration 4: Use Streams App Data collection Via Pinpoint Schema config
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Batch Streaming process with 24-hour 2 hour a delay of a couple minutes 2. Fixed, inflexible Auto-updating DB schema 3. Analysis difficult and slow via SQL Iteration 4: Use Streams App Data collection Via Pinpoint Schema config
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Use generic message types • Publish the data to: • S3 • Redshift • ElasticSearch Iteration 5: Generic message types App ElasticSearch
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Iteration 5 App Data collection ElasticSearch Consumer Lambdas SQL reports Dashboards ProtoBuf Consumer Redshifts
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Batch Streaming process with 24-hour 2 hour a few seconds delay 2. Fixed, inflexible Auto-updating DB schema and generic message types 3. Analysis difficult and slow via SQL flexible by processing message payload Iteration 5: Generic message types
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data flow App ElasticSearch Consumer Redshifts Consumer Lambdas SQL reports Dashboards
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Future improvements Some ideas to make the system even better: 1. Use Kinesis Analytics to query the real-time data stream 2. Use AWS Athena to query data directly from S3 3. Use AWS Amazon AI Services to do deeper data analysis
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Summary Did we solve our use cases? 1. Real-time metrics and alarming 2. Real-time dashboards 3. Real-time logs and crash troubleshooting 4. Monitoring new releases 5. Sharing data with other teams 6. Deeper analytics
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of Streaming 1. Agility: real-time data means your business can react quicker 2. Flexibility: generic message types give you flexible schemas so your system can handle multiple data types and future use cases 3. Shareability: streams allow you to multiplex and share your data easily with your consumers 4. Extensibility: Processing streams of data allows us to write it to multiple data storage systems, which enables a variety of analytics tools
  • 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you!