Learn to scale UP

the right way

Implement powerful pipelines and reach scale with:
Cloud, SRE, Analytics, Machine Learning, Microservices
    Data Analytics

    know more about your business using key metrics

    + analytics pipeline for metric data on GCP

    + lambda / stream processing for log data

    + quickest setup for a modern BI pipeline

    + scalable batch processing for event data

    CI / CD

    increase productivity of code delivery with automated workflows

    + painless setup using circleci and ansible

    + scalable kubernetes on gcp with cloud build

    + use aws code pipeline and code deploy to build a traditional rails application

    Site Reliability

    maintain infrastructure and maximize availability of workloads

    + monitoring / alerting

    + automate aws with ssm, config, cloudwatch

    + automation with opsworks chef, ansible

    + kubernetes

    + logging with elk stack

    + high availability on aws

    Cloud Infrastructure

    + high availability in aws

    + security in aws

    + networking in aws

    + storage in aws

    Security

    + distribution / decomposition

    + communication

    + monitoring

    + data management

    Microservices

    + kubernetes managed cluster on AWS EKS

    + nomad with less complex container cluster

    + easier Kubernetes management with GKE Autopilot

    + easier Kubernetes management with AWS Fargate

    Machine Learning

    + distribution / decomposition

    + communication

    + monitoring

    + data management

What are We Building?


Data Analytics Pipelines

know more about your business using key metrics

Big data analytics examines large amounts of data to uncover hidden patterns, correlations and other insights. With today’s technology, it’s possible to analyze your data and get answers from it almost immediately – an effort that’s slower and less efficient with more traditional business intelligence solutions.

Tools

Distributed and fault-tolerant realtime computation

big data, stream processing, analytics, distributed, dag

Fast and general engine for large-scale data processing

big data, stream processing, analytics, batch processing

distributed event streaming platform

big data, stream processing, streaming

Airflow is a platform to programmatically author, schedule and monitor workflows

big data, analytics, data engineering, etl, pipelines

Fast and reliable large-scale data processing engine

big data, stream processing, analytics, distributed

Distributed SQL Query Engine for Big Data

big data, sql, analytics

Hudi ingests & manages storage of large analytical datasets over DFS

big data, data lakes

Reliable Data Lakes at Scale

big data, data lakes

Always know what to expect from your data

3.4k

A reliable system to process and distribute data

big data, message queue, etl, analytics

Data Warehouse Software for Reading, Writing, and Managing Large Datasets

big data

Resources

article

.

architecture, infrastructure, machine learning, sql, spark

.

10/15/2020

by Matt Bornstein / Martin Casado / Jennifer Li from a16z.com

Five years ago, if you were building a system, it was a result of the code you wrote. Now, it’s built around the data that is fed into that system. And a new class of tools and technologies have emerged to process data for both analytics and operational AI/ ML.


article

.

data engineering, architecture, data lake, optimization, pipeline

.

03/01/2020

by Satish Chandra Gupta from satishchandragupta.com

For deploying big-data analytics, data science, and machine learning (ML) applications in real-world, analytics-tuning and model-training is only around 25% of the work. Approximately 50% of the effort goes into making data ready for analytics and ML. This article gives an introduction to the data pipeline and an overview of big data architecture alternatives


100 min read

documentation

.

spark, analytics, data, processing, etl

from databricks.com

The official page for common terms used in the Spark ecosystem


tutorial

.

spark, r, data, notebook, analytics

.

01-02-2017

by Max Woolf from minimaxir.com

An example notebook using Spark and R to process and analyze Product Reviews on Amazon


tutorial

.

logs, spark, data, databricks, rdd

.

04/21/2015

by Ion Stoica / Vida Ha from databricks.com

Databricks provides a powerful platform to process, analyze, and visualize Josh Duffney big and small data in one place. In this blog, we will illustrate how to analyze access logs of an Apache HTTP web server using Notebooks. Notebooks allow users to write and run arbitrary Apache Spark code and interactively visualize the results. Currently, notebooks support three languages: Scala, Python, and SQL. In this blog, we will be using Python for illustration.


90 min read

course

.

streaming, aws, dataframes, rdd, sql, spark

from sparkbyexamples.com

In this Apache Spark Tutorial, you will learn Spark with Scala code examples and every sample example explained here is available at Spark Examples Github Project for reference. All Spark examples provided in this Apache Spark Tutorials are basic, simple, easy to practice for beginners who are enthusiastic to learn Spark, and these sample examples were tested in our development environment.


Use Cases

monitor company growth

optimize operational goals within teams by tracking key metrics

identify customer churn

turn critical user events from funnels into actionable insights

risk modeling

preventing costly errors and support investment

security analysis

be aware of potential attacks or suspicious behavior by analysing logs

Diagrams

Spark Pipeline

nifi log collector -> kafka queue -> spark processing / hive and hdfs -> tableau<

AWS Pipeline

fluentd log collector -> kinesis stream -> emr spark processing -> redshift -> tableau

Guides

design a robust analytics system with batch processing using airflow, singer, spark, bigquery and tableau

facilitate real time data flow using flink, kafka, s3 and presto to instantly react on new events

get most of what you need in a minimal yet powerful analytics system using only fivetran, snowflake and mode

ship, process and store large amounts of events using snowplow, fluentd, airflow, spark and redshift

Glossary

copyright upslug.com @ seattle