Sentry with bigdata suite

Dear Sentry Community,

I’m doing a POC on big data suite, including Airflow, Hive, Redshift, Spark (Scala and Pyspark). What I’m looking at is to 1) gather all logs in one place, searchable, insightful, realtime 2) get context of error when incident happened 3) build offline analytics on logging of big data. And we are using AWS suite.

Few questions over here at both integration and SDKS:

  1. For airflow, it works like a charm when I test it, it capture error with rich breakcrumbs. When the operator is Hive/Spark, airflow usually display and gather the compute log. Does Sentry integration with airflow also capture error from these log, e.g. how a query get failed or run time error on Spark?

  2. For Spark, I’m not able to make it work (local standalone spark instance), do we have any code repo/example on usage? We have pysaprk and scala spark, but majorly scala spark. Do we suggest to use Log4j or the Spark integration on the SDK instruction?

  3. For Hive, this is something that I’m not very sure how to gather logs, can I get some direction to testing? Or example that I can learn from

  4. For Yarn or other resource management, any suggestion

  5. For Redshift, I think this is going to be complex, as I don’t think we have an SDK ready for that, and it doesn’t use Log4j.

  6. All of above logs are available in a file repo (s3), does sentry support any integration to ingest s3 log into Sentry server? Or do we suggest it? I found that Sentry is not a logging solution but a error logging with rich context.

Thanks a lot, I really appreciate any direction from Sentry engineer and community :slight_smile:
Martin